Understanding Deep Learning
Simon J.D. Prince
May 27, 2024
If you enjoy this book, here are four ways you can help me:
1. Spread the word via social media. Posts in languages other than English par-
ticularly welcome. Tag me on LinkedIn or X and I’ll probably say hi.
2. Write me an Amazon review. Preferably positive, but all publicity is good
publicity...
3. Send me comments (see bottom of this page). I reply to everything eventually.
4. Buy a copy. I took 18 months completely o work to write this book and ideally
I’d like to make minimum wage (or better) for this time. Also, I’d like to write
a second edition, but I need to sell enough copies to do this. Thanks!
The most recent version of this document can be found at http://udlbook.com.
Copyright in this work has been licensed exclusively to The MIT Press,
https://mitpress.mit.edu, which released the nal version to the public in December 2023.
Inquiries regarding rights should be addressed to the MIT Press, Rights & Permissions
Department.
This work is subject to a Creative Commons CC-BY-NC-ND license.
I would really appreciate help improving this document. No detail too small! Please contact
me with suggestions, factual inaccuracies, ambiguities, questions, and errata via github or by
e-mail at udlbookmail@gmail.com.
This book is dedicated to Blair, Calvert, Coppola, Ellison, Faulkner, Kerpatenko, Morris,
Robinson, Sträussler, Wallace, Waymon, Wojnarowicz, and all the others whose work is
even more important and interesting than deep learning.
Contents
Preface ix
Acknowledgements xi
1 Introduction 1
1.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Structure of book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.6 Other books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.7 How to read this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Supervised learning 17
2.1 Supervised learning overview . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Linear regression example . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Shallow neural networks 25
3.1 Neural network example . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Universal approximation theorem . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Multivariate inputs and outputs . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Shallow neural networks: general case . . . . . . . . . . . . . . . . . . . . 33
3.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Deep neural networks 41
4.1 Composing neural networks . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 From composing networks to deep networks . . . . . . . . . . . . . . . . 43
4.3 Deep neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Matrix notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Shallow vs. deep neural networks . . . . . . . . . . . . . . . . . . . . . . 49
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Draft: please send errata to udlbookmail@gmail.com.
iv Contents
5 Loss functions 56
5.1 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Recipe for constructing loss functions . . . . . . . . . . . . . . . . . . . . 60
5.3 Example 1: univariate regression . . . . . . . . . . . . . . . . . . . . . . 61
5.4 Example 2: binary classication . . . . . . . . . . . . . . . . . . . . . . . 64
5.5 Example 3: multiclass classication . . . . . . . . . . . . . . . . . . . . . 67
5.6 Multiple outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.7 Cross-entropy loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6 Fitting models 77
6.1 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.4 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.5 Training algorithm hyperparameters . . . . . . . . . . . . . . . . . . . . 91
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7 Gradients and initialization 96
7.1 Problem denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.2 Computing derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.3 Toy example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.4 Backpropagation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.5 Parameter initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.6 Example training code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8 Measuring performance 118
8.1 Training a simple model . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.2 Sources of error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.3 Reducing error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8.4 Double descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.5 Choosing hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
9 Regularization 138
9.1 Explicit regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
9.2 Implicit regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
9.3 Heuristics to improve performance . . . . . . . . . . . . . . . . . . . . . . 144
9.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
10 Convolutional networks 161
10.1 Invariance and equivariance . . . . . . . . . . . . . . . . . . . . . . . . . 161
10.2 Convolutional networks for 1D inputs . . . . . . . . . . . . . . . . . . . . 163
10.3 Convolutional networks for 2D inputs . . . . . . . . . . . . . . . . . . . . 170
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Contents v
10.4 Downsampling and upsampling . . . . . . . . . . . . . . . . . . . . . . . 171
10.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
10.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
11 Residual networks 186
11.1 Sequential processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
11.2 Residual connections and residual blocks . . . . . . . . . . . . . . . . . . 189
11.3 Exploding gradients in residual networks . . . . . . . . . . . . . . . . . . 192
11.4 Batch normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
11.5 Common residual architectures . . . . . . . . . . . . . . . . . . . . . . . 195
11.6 Why do nets with residual connections perform so well? . . . . . . . . . 199
11.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
12 Transformers 207
12.1 Processing text data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
12.2 Dot-product self-attention . . . . . . . . . . . . . . . . . . . . . . . . . . 208
12.3 Extensions to dot-product self-attention . . . . . . . . . . . . . . . . . . 213
12.4 Transformer layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
12.5 Transformers for natural language processing . . . . . . . . . . . . . . . . 216
12.6 Encoder model example: BERT . . . . . . . . . . . . . . . . . . . . . . . 219
12.7 Decoder model example: GPT3 . . . . . . . . . . . . . . . . . . . . . . . 222
12.8 Encoder-decoder model example: machine translation . . . . . . . . . . . 226
12.9 Transformers for long sequences . . . . . . . . . . . . . . . . . . . . . . . 227
12.10 Transformers for images . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
12.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
13 Graph neural networks 240
13.1 What is a graph? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
13.2 Graph representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
13.3 Graph neural networks, tasks, and loss functions . . . . . . . . . . . . . . 245
13.4 Graph convolutional networks . . . . . . . . . . . . . . . . . . . . . . . . 248
13.5 Example: graph classication . . . . . . . . . . . . . . . . . . . . . . . . 251
13.6 Inductive vs. transductive models . . . . . . . . . . . . . . . . . . . . . . 252
13.7 Example: node classication . . . . . . . . . . . . . . . . . . . . . . . . . 253
13.8 Layers for graph convolutional networks . . . . . . . . . . . . . . . . . . 256
13.9 Edge graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
13.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
14 Unsupervised learning 268
14.1 Taxonomy of unsupervised learning models . . . . . . . . . . . . . . . . . 268
14.2 What makes a good generative model? . . . . . . . . . . . . . . . . . . . 269
14.3 Quantifying performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
14.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
15 Generative Adversarial Networks 275
Draft: please send errata to udlbookmail@gmail.com.
vi Contents
15.1 Discrimination as a signal . . . . . . . . . . . . . . . . . . . . . . . . . . 275
15.2 Improving stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
15.3 Progressive growing, minibatch discrimination, and truncation . . . . . . 286
15.4 Conditional generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
15.5 Image translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
15.6 StyleGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
15.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
16 Normalizing ows 303
16.1 1D example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
16.2 General case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
16.3 Invertible network layers . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
16.4 Multi-scale ows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
16.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
16.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
17 Variational autoencoders 326
17.1 Latent variable models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
17.2 Nonlinear latent variable model . . . . . . . . . . . . . . . . . . . . . . . 327
17.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
17.4 ELBO properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
17.5 Variational approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 335
17.6 The variational autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . 335
17.7 The reparameterization trick . . . . . . . . . . . . . . . . . . . . . . . . . 338
17.8 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
17.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
18 Diusion models 348
18.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
18.2 Encoder (forward process) . . . . . . . . . . . . . . . . . . . . . . . . . . 349
18.3 Decoder model (reverse process) . . . . . . . . . . . . . . . . . . . . . . . 355
18.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
18.5 Reparameterization of loss function . . . . . . . . . . . . . . . . . . . . . 360
18.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
18.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
19 Reinforcement learning 373
19.1 Markov decision processes, returns, and policies . . . . . . . . . . . . . . 373
19.2 Expected return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
19.3 Tabular reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . 381
19.4 Fitted Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
19.5 Policy gradient methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
19.6 Actor-critic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
19.7 Oine reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . 394
19.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Contents vii
20 Why does deep learning work? 401
20.1 The case against deep learning . . . . . . . . . . . . . . . . . . . . . . . . 401
20.2 Factors that inuence tting performance . . . . . . . . . . . . . . . . . 402
20.3 Properties of loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . 406
20.4 Factors that determine generalization . . . . . . . . . . . . . . . . . . . . 410
20.5 Do we need so many parameters? . . . . . . . . . . . . . . . . . . . . . . 414
20.6 Do networks have to be deep? . . . . . . . . . . . . . . . . . . . . . . . . 417
20.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
21 Deep learning and ethics 420
21.1 Value alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
21.2 Intentional misuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
21.3 Other social, ethical, and professional issues . . . . . . . . . . . . . . . . 428
21.4 Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
21.5 The value-free ideal of science . . . . . . . . . . . . . . . . . . . . . . . . 431
21.6 Responsible AI research as a collective action problem . . . . . . . . . . 432
21.7 Ways forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
21.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
A Notation 436
B Mathematics 439
B.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
B.2 Binomial coecients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
B.3 Vector, matrices, and tensors . . . . . . . . . . . . . . . . . . . . . . . . . 442
B.4 Special types of matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
B.5 Matrix calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
C Probability 448
C.1 Random variables and probability distributions . . . . . . . . . . . . . . 448
C.2 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
C.3 Normal probability distribution . . . . . . . . . . . . . . . . . . . . . . . 456
C.4 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
C.5 Distances between probability distributions . . . . . . . . . . . . . . . . . 459
Bibliography 462
Index 513
Draft: please send errata to udlbookmail@gmail.com.
Preface
The history of deep learning is unusual in science. The perseverance of a small cabal of
scientists, working over twenty-ve years in a seemingly unpromising area, has revolution-
ized a eld and dramatically impacted society. Usually, when researchers investigate an
esoteric and apparently impractical corner of science or engineering, it remains just that
esoteric and impractical. However, this was a notable exception. Despite widespread
skepticism, the systematic eorts of Yoshua Bengio, Georey Hinton, Yann LeCun, and
others eventually paid o.
The title of this book is “Understanding Deep Learning” to distinguish it from vol-
umes that cover coding and other practical aspects. This text is primarily about the
ideas that underlie deep learning. The rst part of the book introduces deep learning
models and discusses how to train them, measure their performance, and improve this
performance. The next part considers architectures that are specialized to images, text,
and graph data. These chapters require only introductory linear algebra, calculus, and
probability and should be accessible to any second-year undergraduate in a quantitative
discipline. Subsequent parts of the book tackle generative models and reinforcement
learning. These chapters require more knowledge of probability and calculus and target
more advanced students.
The title is also partly a joke no-one really understands deep learning at the time of
writing. Modern deep networks learn piecewise linear functions with more regions than
there are atoms in the universe and can be trained with fewer data examples than model
parameters. It is neither obvious that we should be able to t these functions reliably
nor that they should generalize well to new data. The penultimate chapter addresses
these and other aspects that are not yet fully understood. Regardless, deep learning will
change the world for better or worse. The nal chapter discusses AI ethics and concludes
with an appeal for practitioners to consider the moral implications of their work.
Your time is precious, and I have striven to curate and present the material so you
can understand it as eciently as possible. The main body of each chapter comprises
a succinct description of only the most essential ideas, together with accompanying
illustrations. The appendices review all mathematical prerequisites, and there should be
no need to refer to external material. For readers wishing to delve deeper, each chapter
has associated problems, Python notebooks, and extensive background notes.
Writing a book is a lonely, grinding, multiple-year process and is only worthwhile if
the volume is widely adopted. If you enjoy reading this or have suggestions for improving
it, please contact me via the accompanying website. I would love to hear your thoughts,
which will inform and motivate subsequent editions.
Draft: please send errata to udlbookmail@gmail.com.
Acknowledgments
Writing this book would not have been possible without the generous help and advice of these
individuals: Kathryn Hume, Kevin Murphy, Christopher Bishop, Peng Xu, Yann Dubois, Justin
Domke, Chris Fletcher, Yanshuai Cao, Wendy Tay, Corey Toler-Franklin, Dmytro Mishkin, Guy
McCusker, Daniel Worrall, Paul McIlroy, Roy Amoyal, Austin Anderson, Romero Barata de
Morais, Gabriel Harrison, Peter Ball, Alf Muir, David Bryson, Vedika Parulkar, Patryk Lietzau,
Jessica Nicholson, Alexa Huxley, Oisin Mac Aodha, Giuseppe Castiglione, Josh Akylbekov, Alex
Gougoulaki, Joshua Omilabu, Alister Guenther, Joe Goodier, Logan Wade, Joshua Guenther,
Kylan Tobin, Benedict Ellett, Jad Araj, Andrew Glennerster, Giorgos Skas, Diya Vibhakar,
Sam Mansat-Bhattacharyya, Ben Ross, Ivor Simpson, Gaurang Aggarwal, Shakeel Sheikh, Ja-
cob Horton, Felix Rammell, Sasha Luccioni, Akshil Patel, Alessandro Gentilini, Kevin Mercier,
Krzysztof Lichocki, Chuck Krapf, Brian Ha, Chris Kang, Leonardo Viotti, Kai Li, Himan Ab-
dollahpouri, Ari Pakman, Giuseppe Antonio Di Luna, Dan Oneat
,
ă, Conrad Whiteley, Joseph
Santarcangelo, Brad Shook, Gabriel Brostow, Lei He, Ali Satvaty, Romain Sabathé, Qiang Zhou,
Prasanna Vigneswaran, Siqi Zheng, Stephan Grein, Jonas Klesen, Giovanni Stilo, Huang Bokai,
Kevin McGuinness, Qiang Sun, Zakaria Lot, Yifei Lin, Sylvain Bouix, Alex Pitt, Stephane
Chretien, Robin Liu, Bian Li, Adam Jones, Marcin Świerkot, Tommy Löfstedt, Eugen Ho-
taj, Fernando Flores-Mangas, Tony Polichroniadis, Pietro Monticone, Rohan Deepak Ajwani,
Menashe Yarden Einy, Robert Gevorgyan, Thilo Stadelmann, Gui JieMiao, Botao Zhu, Mo-
hamed Elabbas, Satya Krishna Gorti, James Elder, Helio Perroni Filho, Xiaochao Qu, Jaekang
Shin, Joshua Evans, Robert Dobson, Shibo Wang, Edoardo Zorzi, Stanisław Jastrzębski, Pieris
Kalligeros, Matt Hewitt, Zvika Haramaty, Ted Mavroidis, Nikolaj Kuntner, Amir Yorav, Ma-
soud Mokhtari, Xavier Gabaix, Marco Garosi, Vincent Schönbach, Avishek Mondal, Victor
S.C. Lui, Sumit Bhatia, Julian Asilis, Hengchao Chen, Siavash Khallaghi, Csaba Szepesvári,
Mike Singer, Mykhailo Shvets, Abdalla Ibrahim, Stefan Hell, Ron Raphaeli, Diogo Tavares,
Aristotelis Siozopoulos, Jianrui Wu, Jannik Münz, Penn Mackintosh, Shawn Hoareau, Qianang
Zhou, Emma Li, Charlie Groves, Xiang Lingxiao, Trivikram Muralidharan, Rajat Binaykiya,
Germán del Cacho Salvador, Alexey Bloudov, Paul Colognese, Bo Yang, Jani Monoses, Adenil-
son Arcanjo, Matan Golani, Emmanuel Onzon, Shenghui Yan, Kamesh Kompella, Julius Aka,
Johannes Brunnemann, Varniethan Ketheeswaran, Alex Ostrovsky, Daniel Burbank, Gavrie
Philipson, Roozbeh Ehsani, Len Spek, Christoph Brune, Mohammad Nosrati, Bian Li, Runqi
Chen, Qifu Hu, Rasmi Elasmar, Ronaldo Butrus, Carles Mesado, Jerey Wolberg, Olivier Koch,
Edoardo Lanari, Fanmin Shi, Neel Maniar, Maksym Taran, Falk Langhammer, Reinaldo Lep-
sch, Max Talberg, Vishal Jain, Christian Arnold, Charles Hill, Nikita Panin, Steven Dillmann,
Suhas Mathur, Harris Abdul Majid, Guolong Lin, Charles Elkan, Benedict Kuester, Vladimir
Ivanov, Mohammad-Hadi Sotoudeh, Daniel Enériz Orta, Ian Jerey, Kwok Chun, Yu Liu, Tom
Vettenburg, Aravinda Perera, Daniel Gigliotti, Iftikhar Ramnandan, Adnan Siddiquei, Will
Knottenbelt, Valerio Di Stefano, Srikant Jayaraman, Goldie Srulovich, Rafał Rolczyński, An-
thony Ip, and Andre Coelho.
Draft: please send errata to udlbookmail@gmail.com.
xii Contents
I’m particularly grateful to Daniyar Turmukhambetov, Amedeo Buonanno, Andrea Panizza,
Mark Hudson, and Bernhard Pfahringer, who provided detailed comments on multiple chapters
of the book. I’d like to especially thank Andrew Fitzgibbon, Konstantinos Derpanis, and Tyler
Mills, who read the whole book and whose enthusiasm helped me complete this project. I’d
also like to thank Neill Campbell and Özgür Şimşek, who hosted me at the University of Bath,
where I taught a course based on this material for the rst time. Finally, I’m extremely grateful
to my editor Elizabeth Swayze for her frank advice throughout this process.
Chapter 12 (transformers) and chapter 17 (variational autoencoders) were rst published
as blogs for Borealis AI, and adapted versions are reproduced with permission of Royal Bank
of Canada along with Borealis AI. I am grateful for their support in this endeavor. Chapter 16
(normalizing ows) is loosely based on the review article by Kobyzev et al. (2020), on which
I was a co-author. I was very fortunate to be able to collaborate on Chapter 21 with Travis
LaCroix from Dalhousie University, who was both easy and fun to work with, and who did the
lion’s share of the work.
Attribution
Chessboard image in gure 1.13 adapted from http://tinyurl.com/yc2d54d4.
Cogs image in gures 1.2, 1.4, 1.10 adapted from http://tinyurl.com/2c7tttr8.
Penguin image in gures 19.1–19.5 and 19.6–19.9 adapted from http://tinyurl.com/ycx9je56.
Fish image in gures 19.2–19.5, 19.7, 19.10–19.12 adapted from http://tinyurl.com/4ueyhtsu.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Chapter 1
Introduction
Articial intelligence, or AI, is concerned with building systems that simulate intelligent
behavior. It encompasses a wide range of approaches, including those based on logic,
search, and probabilistic reasoning. Machine learning is a subset of AI that learns to
make decisions by tting mathematical models to observed data. This area has seen
explosive growth and is now (incorrectly) almost synonymous with the term AI.
A deep neural network is a type of machine learning model, and when it is tted
to data, this is referred to as deep learning. At the time of writing, deep networks are
the most powerful and practical machine learning models and are often encountered
in day-to-day life. It is commonplace to translate text from another language using a
natural language processing algorithm, to search the internet for images of a particular
object using a computer vision system, or to converse with a digital assistant via a speech
recognition interface. All of these applications are powered by deep learning.
As the title suggests, this book aims to help a reader new to this eld understand
the principles behind deep learning. The book is neither terribly theoretical (there are
no proofs) nor extremely practical (there is almost no code). The goal is to explain the
underlying ideas; after consuming this volume, the reader will be able to apply deep
learning to novel situations where there is no existing recipe for success.
Machine learning methods can coarsely be divided into three areas: supervised, unsu-
pervised, and reinforcement learning. At the time of writing, the cutting-edge methods
in all three areas rely on deep learning (gure 1.1). This introductory chapter describes
these three areas at a high level, and this taxonomy is also loosely reected in the book’s
organization. Whether we like it or not, deep learning is poised to change our world,
and this change will not all be positive. Hence, this chapter also contains a brief primer
on AI ethics. We conclude with advice on how to make the most of this book.
1.1 Supervised learning
Supervised learning models dene a mapping from input data to an output prediction.
In the following sections, we discuss the inputs, the outputs, the model itself, and what
is meant by “training” a model.
Draft: please send errata to udlbookmail@gmail.com.
2 1 Introduction
Figure 1.1 Machine learning is an area
of articial intelligence that ts math-
ematical models to observed data. It
can coarsely be divided into supervised
learning, unsupervised learning, and re-
inforcement learning. Deep neural net-
works contribute to each of these areas.
1.1.1 Regression and classication problems
Figure 1.2 depicts several regression and classication problems. In each case, there is a
meaningful real-world input (a sentence, a sound le, an image, etc.), and this is encoded
as a vector of numbers. This vector forms the model input. The model maps the input to
an output vector which is then “translated” back to a meaningful real-world prediction.
For now, we focus on the inputs and outputs and treat the model as a black box that
ingests a vector of numbers and returns another vector of numbers.
The model in gure 1.2a predicts the price of a house based on input characteristics
such as the square footage and the number of bedrooms. This is a regression problem
because the model returns a continuous number (rather than a category assignment).
In contrast, the model in gure 1.2b takes the chemical structure of a molecule as an
input and predicts both the freezing and boiling points. This is a multivariate regression
problem since it predicts more than one number.
The model in gure 1.2c receives a text string containing a restaurant review as input
and predicts whether the review is positive or negative. This is a binary classication
problem because the model attempts to assign the input to one of two categories. The
output vector contains the probabilities that the input belongs to each category. Fig-
ures 1.2d and 1.2e depict multiclass classication problems. Here, the model assigns the
input to one of N > 2 categories. In the rst case, the input is an audio le, and the
model predicts which genre of music it contains. In the second case, the input is an
image, and the model predicts which object it contains. In each case, the model returns
a vector of size N that contains the probabilities of the N categories.
1.1.2 Inputs
The input data in gure 1.2 varies widely. In the house pricing example, the input is a
xed-length vector containing values that characterize the property. This is an example
of tabular data because it has no internal structure; if we change the order of the inputs
and build a new model, then we expect the model prediction to remain the same.
Conversely, the input in the restaurant review example is a body of text. This may
be of variable length depending on the number of words in the review, and here input
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
1.1 Supervised learning 3
Figure 1.2 Regression and classication problems. a) This regression model takes
a vector of numbers that characterize a property and predicts its price. b) This
multivariate regression model takes the structure of a chemical molecule and
predicts its freezing and boiling points. c) This binary classication model takes a
restaurant review and classies it as either positive or negative. d) This multiclass
classication problem assigns a snippet of audio to one of N genres. e) A second
multiclass classication problem in which the model classies an image according
to which of N possible objects it might contain.
Draft: please send errata to udlbookmail@gmail.com.
4 1 Introduction
Figure 1.3 Machine learning model. The model represents a family of relationships
that relate the input (age of child) to the output (height of child). The particular
relationship is chosen using training data, which consists of input/output pairs
(orange points). When we train the model, we search through the possible re-
lationships for one that describes the data well. Here, the trained model is the
cyan curve and can be used to compute the height for any age.
order is important; my wife ate the chicken is not the same as the chicken ate my wife.
The text must be encoded into numerical form before passing it to the model. Here, we
use a xed vocabulary of size 10,000 and simply concatenate the word indices.
For the music classication example, the input vector might be of xed size (perhaps
a 10-second clip) but is very high-dimensional. Digital audio is usually sampled at 44.1
kHz and represented by 16-bit integers, so a ten-second clip consists of 441, 000 integers.
Clearly, supervised learning models will have to be able to process sizeable inputs. The
input in the image classication example (which consists of the concatenated RGB values
at every pixel) is also enormous. Moreover, its structure is naturally two-dimensional;
two pixels above and below one another are closely related, even if they are not adjacent
in the input vector.
Finally, consider the input for the model that predicts the freezing and boiling points
of the molecule. A molecule may contain varying numbers of atoms that can be connected
in dierent ways. In this case, the model must ingest both the geometric structure of
the molecule and the constituent atoms to the model.
1.1.3 Machine learning models
Until now, we have treated the machine learning model as a black box that takes an input
vector and returns an output vector. But what exactly is in this black box? Consider a
model to predict the height of a child from their age (gure 1.3). The machine learning
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
1.1 Supervised learning 5
model is a mathematical equation that describes how the average height varies as a
function of age (cyan curve in gure 1.3). When we run the age through this equation,
it returns the height. For example, if the age is 10 years, then we predict that the height
will be 139 cm.
More precisely, the model represents a family of equations mapping the input to
the output (i.e., a family of dierent cyan curves). The particular equation (curve) is
chosen using training data (examples of input/output pairs). In gure 1.3, these pairs
are represented by the orange points, and we can see that the model (cyan line) describes
these data reasonably. When we talk about training or tting a model, we mean that we
search through the family of possible equations (possible cyan curves) relating input to
output to nd the one that describes the training data most accurately.
It follows that the models in gure 1.2 require labeled input/output pairs for training.
For example, the music classication model would require a large number of audio clips
where a human expert had identied the genre of each. These input/output pairs take
the role of a teacher or supervisor for the training process, and this gives rise to the term
supervised learning.
1.1.4 Deep neural networks
This book concerns deep neural networks, which are a particularly useful type of machine
learning model. They are equations that can represent an extremely broad family of
relationships between input and output, and where it is particularly easy to search
through this family to nd the relationship that describes the training data.
Deep neural networks can process inputs that are very large, of variable length,
and contain various kinds of internal structures. They can output single real numbers
(regression), multiple numbers (multivariate regression), or probabilities over two or more
classes (binary and multiclass classication, respectively). As we shall see in the next
section, their outputs may also be very large, of variable length, and contain internal
structure. It is probably hard to imagine equations with these properties, and the reader
should endeavor to suspend disbelief for now.
1.1.5 Structured outputs
Figure 1.4a depicts a multivariate binary classication model for semantic segmentation.
Here, every pixel of an input image is assigned a binary label that indicates whether it
belongs to a cow or the background. Figure 1.4b shows a multivariate regression model
where the input is an image of a street scene and the output is the depth at each pixel.
In both cases, the output is high-dimensional and structured. However, this structure is
closely tied to the input, and this can be exploited; if a pixel is labeled as “cow,” then a
neighbor with a similar RGB value probably has the same label.
Figures 1.4c–e depict three models where the output has a complex structure that is
not so closely tied to the input. Figure 1.4c shows a model where the input is an audio
le and the output is the transcribed words from that le. Figure 1.4d is a translation
Draft: please send errata to udlbookmail@gmail.com.
6 1 Introduction
Figure 1.4 Supervised learning tasks with structured outputs. a) This semantic
segmentation model maps an RGB image to a binary image indicating whether
each pixel belongs to the background or a cow (adapted from Noh et al., 2015).
b) This monocular depth estimation model maps an RGB image to an output
image where each pixel represents the depth (adapted from Cordts et al., 2016).
c) This audio transcription model maps an audio sample to a transcription of
the spoken words in the audio. d) This translation model maps an English text
string to its French translation. e) This image synthesis model maps a caption to
an image (example from https://openai.com/dall-e-2/). In each case, the output
has a complex internal structure or grammar. In some cases, many outputs are
compatible with the input.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
1.2 Unsupervised learning 7
model in which the input is a body of text in English, and the output contains the French
translation. Figure 1.4e depicts a very challenging task in which the input is descriptive
text, and the model must produce an image that matches this description.
In principle, the latter three tasks can be tackled in the standard supervised learning
framework, but they are more dicult for two reasons. First, the output may genuinely
be ambiguous; there are multiple valid translations from an English sentence to a French
one and multiple images that are compatible with any caption. Second, the output
contains considerable structure; not all strings of words make valid English and French
sentences, and not all collections of RGB values make plausible images. In addition to
learning the mapping, we also have to respect the “grammar” of the output.
Fortunately, this “grammar” can be learned without the need for output labels. For
example, we can learn how to form valid English sentences by learning the statistics of a
large corpus of text data. This provides a connection with the next section of the book,
which considers unsupervised learning models.
1.2 Unsupervised learning
Constructing a model from input data without corresponding output labels is termed
unsupervised learning; the absence of output labels means there can be no “supervision.
Rather than learning a mapping from input to output, the goal is to describe or under-
stand the structure of the data. As was the case for supervised learning, the data may
have very dierent characteristics; it may be discrete or continuous, low-dimensional or
high-dimensional, and of constant or variable length.
1.2.1 Generative models
This book focuses on generative unsupervised models, which learn to synthesize new
data examples that are statistically indistinguishable from the training data. Some
generative models explicitly describe the probability distribution over the input data and
here new examples are generated by sampling from this distribution. Others merely learn
a mechanism to generate new examples without explicitly describing their distribution.
State-of-the-art generative models can synthesize examples that are extremely plau-
sible but distinct from the training examples. They have been particularly successful
at generating images (gure 1.5) and text (gure 1.6). They can also synthesize data
under the constraint that some outputs are predetermined (termed conditional genera-
tion). Examples include image inpainting (gure 1.7) and text completion (gure 1.8).
Indeed, modern generative models for text are so powerful that they can appear intel-
ligent. Given a body of text followed by a question, the model can often “ll in” the
missing answer by generating the most likely completion of the document. However, in
reality, the model only knows about the statistics of language and does not understand
the signicance of its answers.
Draft: please send errata to udlbookmail@gmail.com.
8 1 Introduction
Figure 1.5 Generative models for images. Left: two images were generated from
a model trained on pictures of cats. These are not real cats, but samples from a
probability model. Right: two images generated from a model trained on images
of buildings. Adapted from Karras et al. (2020b).
The moon had risen by the time I reached the edge of the forest, and the light that ltered through the
trees was silver and cold. I shivered, though I was not cold, and quickened my pace. I had never been
so far from the village before, and I was not sure what to expect. I had been walking for hours, and I
was tired and hungry. I had left in such a hurry that I had not thought to pack any food, and I had
not thought to bring a weapon. I was unarmed and alone in a strange place, and I did not know what
I was doing.
I had been walking for so long that I had lost all sense of time, and I had no idea how far I had come.
I only knew that I had to keep going. I had to nd her. I was getting close. I could feel it. She was
nearby, and she was in trouble. I had to nd her and help her, before it was too late.
Figure 1.6 Short story synthesized from a generative model of text data. The
model describes a probability distribution that assigns a probability to every
output string. Sampling from the model creates strings that follow the statistics
of the training data (here, short stories) but have never been seen before.
Figure 1.7 Inpainting. In the original image (left), the boy is obscured by metal
cables. These undesirable regions (center) are removed and the generative model
synthesizes a new image (right) under the constraint that the remaining pixels
must stay the same. Adapted from Saharia et al. (2022a).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
1.2 Unsupervised learning 9
I was a little nervous before my rst lecture at the University of Bath. It seemed like there were
hundreds of students and they looked intimidating. I stepped up to the lectern and was about to speak
when something bizarre happened.
Suddenly, the room was lled with a deafening noise, like a giant roar. It was so loud that I
couldn’t hear anything else and I had to cover my ears. I could see the students looking around, con-
fused and frightened. Then, as quickly as it had started, the noise stopped and the room was silent again.
I stood there for a few moments, trying to make sense of what had just happened. Then I realized that
the students were all staring at me, waiting for me to say something. I tried to think of something witty
or clever to say, but my mind was blank. So I just said, “Well, that was strange,’ and then I started my
lecture.
Figure 1.8 Conditional text synthesis. Given an initial body of text (in black),
generative models of text can continue the string plausibly by synthesizing the
“missing” remaining part of the string. Generated by GPT3 (Brown et al., 2020).
Figure 1.9 Variation of the human face. The human face contains roughly 42
muscles, so it’s possible to describe most of the variation in images of the same
person in the same lighting with just 42 numbers. In general, datasets of images,
music, and text can be described by a relatively small number of underlying
variables although it is typically more dicult to tie these to particular physical
mechanisms. Images from Dynamic FACES database (Holland et al., 2019).
1.2.2 Latent variables
Some (but not all) generative models exploit the observation that data can be lower
dimensional than the raw number of observed variables suggests. For example, the num-
ber of valid and meaningful English sentences is considerably smaller than the number
of strings created by drawing words at random. Similarly, real-world images are a tiny
subset of the images that can be created by drawing random RGB values for every pixel.
This is because images are generated by physical processes (see gure 1.9).
This leads to the idea that we can describe each data example using a smaller number
of underlying latent variables. Here, the role of deep learning is to describe the mapping
between these latent variables and the data. The latent variables typically have a simple
Draft: please send errata to udlbookmail@gmail.com.
10 1 Introduction
Figure 1.10 Latent variables. Many generative models use a deep learning model
to describe the relationship between a low-dimensional “latent” variable and the
observed high-dimensional data. The latent variables have a simple probability
distribution by design. Hence, new examples can be generated by sampling from
the simple distribution over the latent variables and then using the deep learning
model to map the sample to the observed data space.
Figure 1.11 Image interpolation. In each row the left and right images are real
and the three images in between represent a sequence of interpolations created
by a generative model. The generative models that underpin these interpolations
have learned that all images can be created by a set of underlying latent variables.
By nding these variables for the two real images, interpolating their values, and
then using these intermediate variables to create new images, we can generate
intermediate results that are both visually plausible and mix the characteristics
of the two original images. Top row adapted from Sauer et al. (2022). Bottom
row adapted from Ramesh et al. (2022).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
1.3 Reinforcement learning 11
Figure 1.12 Multiple images generated from the caption “A teddy bear on a
skateboard in Times Square. Generated by DALL·E-2 (Ramesh et al., 2022).
probability distribution by design. By sampling from this distribution and passing the
result through the deep learning model, we can create new samples (gure 1.10).
These models lead to new methods for manipulating real data. For example, consider
nding the latent variables that underpin two real examples. We can interpolate between
these examples by interpolating between their latent representations and mapping the
intermediate positions back into the data space (gure 1.11).
1.2.3 Connecting supervised and unsupervised learning
Generative models with latent variables can also benet supervised learning models
where the outputs have structure (gure 1.4). For example, consider learning to predict
the images corresponding to a caption. Rather than directly map the text input to an
image, we can learn a relation between latent variables that explain the text and the
latent variables that explain the image.
This has three advantages. First, we may need fewer text/image pairs to learn this
mapping now that the inputs and outputs are lower dimensional. Second, we are more
likely to generate a plausible-looking image; any sensible values of the latent variables
should produce something that looks like a plausible example. Third, if we introduce
randomness to either the mapping between the two sets of latent variables or the mapping
from the latent variables to the image, then we can generate multiple images that are all
described well by the caption (gure 1.12).
1.3 Reinforcement learning
The nal area of machine learning is reinforcement learning. This paradigm introduces
the idea of an agent which lives in a world and can perform certain actions at each time
step. The actions change the state of the system but not necessarily in a deterministic
way. Taking an action can also produce rewards, and the goal of reinforcement learning
Draft: please send errata to udlbookmail@gmail.com.
12 1 Introduction
is for the agent to learn to choose actions that lead to high rewards on average.
One complication is that the reward may occur some time after the action is taken,
so associating a reward with an action is not straightforward. This is known as the
temporal credit assignment problem. As the agent learns, it must trade o exploration
and exploitation of what it already knows; perhaps the agent has already learned how to
receive modest rewards; should it follow this strategy (exploit what it knows), or should
it try dierent actions to see if it can improve (explore other opportunities)?
1.3.1 Two examples
Consider teaching a humanoid robot to locomote. The robot can perform a limited
number of actions at a given time (moving various joints), and these change the state of
the world (its pose). We might reward the robot for reaching checkpoints in an obstacle
course. To reach each checkpoint, it must perform many actions, and it’s unclear which
ones contributed to the reward when it is received and which were irrelevant. This is an
example of the temporal credit assignment problem.
A second example is learning to play chess. Again, the agent has a set of valid actions
(chess moves) at any given time. However, these actions change the state of the system
in a non-deterministic way; for any choice of action, the opposing player might respond
with many dierent moves. Here, we might set up a reward structure based on capturing
pieces or just have a single reward at the end of the game for winning. In the latter case,
the temporal credit assignment problem is extreme; the system must learn which of the
many moves it made were instrumental to success or failure.
The exploration-exploitation trade-o is also apparent in these two examples. The
robot may have discovered that it can make progress by lying on its side and pushing
with one leg. This strategy will move the robot and yields rewards, but much more slowly
than the optimal solution: to balance on its legs and walk. So, it faces a choice between
exploiting what it already knows (how to slide along the oor awkwardly) and exploring
the space of actions (which might result in much faster locomotion). Similarly, in the
chess example, the agent may learn a reasonable sequence of opening moves. Should it
exploit this knowledge or explore dierent opening sequences?
It is perhaps not obvious how deep learning ts into the reinforcement learning frame-
work. There are several possible approaches, but one technique is to use deep networks
to build a mapping from the observed world state to an action. This is known as a
policy network. In the robot example, the policy network would learn a mapping from
its sensor measurements to joint movements. In the chess example, the network would
learn a mapping from the current state of the board to the choice of move (gure
1.13).
1.4 Ethics
It would be irresponsible to write this book without discussing the ethical implications
of articial intelligence. This potent technology will change the world to at least the
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
1.4 Ethics 13
Figure 1.13 Policy networks for reinforcement learning. One way to incorporate
deep neural networks into reinforcement learning is to use them to dene a map-
ping from the state (here position on chessboard) to the actions (possible moves).
This mapping is known as a policy.
same extent as electricity, the internal combustion engine, the transistor, or the internet.
The potential benets in healthcare, design, entertainment, transport, education, and
almost every area of commerce are enormous. However, scientists and engineers are often
unrealistically optimistic about the outcomes of their work, and the potential for harm
is just as great. The following paragraphs highlight ve concerns.
Bias and fairness: If we train a system to predict salary levels for individuals based
on historical data, then this system will reproduce historical biases; for example, it will
probably predict that women should be paid less than men. Several such cases have
already become international news stories: an AI system for super-resolving face images
made non-white people look more white; a system for generating images produced only
pictures of men when asked to synthesize pictures of lawyers. Careless application of
algorithmic decision-making using AI has the potential to entrench or aggravate existing
biases. See Binns (2018) for further discussion.
Explainability: Deep learning systems make decisions, but we do not usually know
exactly how or based on what information. They may contain billions of parameters,
and there is no way we can understand how they work based on examination. This has
led to the sub-eld of explainable AI. One moderately successful area is producing local
explanations; we cannot explain the entire system, but we can produce an interpretable
description of why a particular decision was made. However, it remains unknown whether
it is possible to build complex decision-making systems that are fully transparent to their
users or even their creators. See Grennan et al. (2022) for further information.
Weaponizing AI: All signicant technologies have been applied directly or indirectly
toward war. Sadly, violent conict seems to be an inevitable feature of human behavior.
AI is arguably the most powerful technology ever built and will doubtless be deployed
extensively in a military context. Indeed, this is already happening (Heikkilä, 2022).
Draft: please send errata to udlbookmail@gmail.com.
14 1 Introduction
Concentrating power: It is not from a benevolent interest in improving the lot of the
human race that the world’s most powerful companies are investing heavily in arti-
cial intelligence. They know that these technologies will allow them to reap enormous
prots. Like any advanced technology, deep learning is likely to concentrate power in
the hands of the few organizations that control it. Automating jobs that are currently
done by humans will change the economic environment and disproportionately aect the
livelihoods of lower-paid workers with fewer skills. Optimists argue similar disruptions
happened during the industrial revolution and resulted in shorter working hours. The
truth is that we simply do not know what eects the large-scale adoption of AI will have
on society (see David, 2015).
Existential risk: The major existential risks to the human race all result from tech-
nology. Climate change has been driven by industrialization. Nuclear weapons derive
from the study of physics. Pandemics are more probable and spread faster because in-
novations in transport, agriculture, and construction have allowed a larger, denser, and
more interconnected population. Articial intelligence brings new existential risks. We
should be very cautious about building systems that are more capable and extensible
than human beings. In the most optimistic case, it will put vast power in the hands
of the owners. In the most pessimistic case, we will be unable to control it or even
understand its motives (see Tegmark, 2018).
This list is far from exhaustive. AI could also enable surveillance, disinformation,
violations of privacy, fraud, and manipulation of nancial markets, and the energy re-
quired to train AI systems contributes to climate change. Moreover, these concerns are
not speculative; there are already many examples of ethically dubious applications of
AI (consult Dao, 2021, for a partial list). In addition, the recent history of the inter-
net has shown how new technology can cause harm in unexpected ways. The online
community of the eighties and early nineties could hardly have predicted the prolifera-
tion of fake news, spam, online harassment, fraud, cyberbullying, incel culture, political
manipulation, doxxing, online radicalization, and revenge porn.
Everyone studying or researching (or writing books about) AI should contemplate
to what degree scientists are accountable for the uses of their technology. We should
consider that capitalism primarily drives the development of AI and that legal advances
and deployment for social good are likely to lag signicantly behind. We should reect
on whether it’s possible, as scientists and engineers, to control progress in this eld and
to reduce the potential for harm. We should consider what kind of organizations we
are prepared to work for. How serious are they in their commitment to reducing the
potential harms of AI? Are they simply “ethics-washing” to reduce reputational risk, or
do they actually implement mechanisms to halt ethically suspect projects?
All readers are encouraged to investigate these issues further. The online course
at https://ethics-of-ai.mooc./ is a useful introductory resource. If you are a professor
teaching from this book, you are encouraged to raise these issues with your students. If
you are a student taking a course where this is not done, then lobby your professor to
make this happen. If you are deploying or researching AI in a corporate environment,
you are encouraged to scrutinize your employer’s values and to help change them (or
leave) if they are wanting.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
1.5 Structure of book 15
1.5 Structure of book
The structure of the book follows the structure of this introduction. Chapters 2–9 walk
through the supervised learning pipeline. We describe shallow and deep neural networks
and discuss how to train them and measure and improve their performance. Chap-
ters 10–13 describe common architectural variations of deep neural networks, including
convolutional networks, residual connections, and transformers. These architectures are
used across supervised, unsupervised, and reinforcement learning.
Chapters 14–18 tackle unsupervised learning using deep neural networks. We devote
a chapter each to four modern deep generative models: generative adversarial networks,
variational autoencoders, normalizing ows, and diusion models. Chapter 19 is a brief
introduction to deep reinforcement learning. This is a topic that easily justies its own
book, so the treatment is necessarily supercial. However, this treatment is intended to
be a good starting point for readers unfamiliar with this area.
Despite the title of this book, some aspects of deep learning remain poorly under-
stood. Chapter 20 poses some fundamental questions. Why are deep networks so easy
to train? Why do they generalize so well? Why do they need so many parameters? Do
they need to be deep? Along the way, we explore unexpected phenomena such as the
structure of the loss function, double descent, grokking, and lottery tickets. The book
concludes with chapter 21, which discusses ethics and deep learning.
1.6 Other books
This book is self-contained but is limited to coverage of deep learning. It is intended to
be the spiritual successor to Deep Learning (Goodfellow et al., 2016) which is a fantastic
resource but does not cover recent advances. For a broader look at machine learning, the
most up-to-date and encyclopedic resource is Probabilistic Machine Learning (Murphy,
2022, 2023). However, Pattern Recognition and Machine Learning (Bishop, 2006) is still
an excellent and relevant book.
If you enjoy this book, then my previous volume, Computer Vision: Models, Learning,
and Inference (Prince, 2012), is still worth reading. Some parts have dated badly, but it
contains a thorough introduction to probability, including Bayesian methods, and good
introductory coverage of latent variable models, geometry for computer vision, Gaussian
processes, and graphical models. It uses identical notation to this book and can be found
online. A detailed treatment of graphical models can be found in Probabilistic Graphical
Models: Principles and Techniques (Koller & Friedman, 2009), and Gaussian processes
are covered by Gaussian Processes for Machine Learning (Williams & Rasmussen, 2006).
For background mathematics, consult Mathematics for Machine Learning (Deisen-
roth et al., 2020). For a more coding-oriented approach, consult Dive into Deep Learning
(Zhang et al., 2023). The best overview for computer vision is Szeliski (2022), and there
is also the impending book Foundations of Computer Vision (Torralba et al., 2024).
A good starting point to learn about graph neural networks is Graph Representation
Learning (Hamilton, 2020). The denitive work on reinforcement learning is Reinforce-
Draft: please send errata to udlbookmail@gmail.com.
16 1 Introduction
ment Learning: An Introduction (Sutton & Barto, 2018). A good initial resource is
Foundations of Deep Reinforcement Learning (Graesser & Keng, 2019).
1.7 How to read this book
Most remaining chapters in this book contain a main body of text, a notes section, and
a set of problems. The main body of the text is intended to be self-contained and can be
read without recourse to the other parts of the chapter. As much as possible, background
mathematics is incorporated into the main body of the text. However, for larger topics
that would be a distraction to the main thread of the argument, the background material
is appendicized, and a reference is provided in the margin. Most notation in this book is
Appendix A
Notation
standard. However, some conventions are less widely used, and the reader is encouraged
to consult appendix A before proceeding.
The main body of text includes many novel illustrations and visualizations of deep
learning models and results. I’ve worked hard to provide new explanations of existing
ideas rather than merely curate the work of others. Deep learning is a new eld, and
sometimes phenomena are poorly understood. I try to make it clear where this is the
case and when my explanations should be treated with caution.
References are included in the main body of the chapter only where results are de-
picted. Instead, they can be found in the notes section at the end of the chapter. I do
not generally respect historical precedent in the main text; if an ancestor of a current
technique is no longer useful, then I will not mention it. However, the historical develop-
ment of the eld is described in the notes section, and hopefully, credit is fairly assigned.
The notes are organized into paragraphs and provide pointers for further reading. They
should help the reader orient themselves within the sub-area and understand how it re-
lates to other parts of machine learning. The notes are less self-contained than the main
text. Depending on your level of background knowledge and interest, you may nd these
sections more or less useful.
Each chapter has a number of associated problems. They are referenced in the margin
of the main text at the point that they should be attempted. As George Pólya noted,
“Mathematics, you see, is not a spectator sport. He was correct, and I highly recommend
that you attempt the problems as you go. In some cases, they provide insights that will
help you understand the main text. Problems for which the answers are provided on the
associated website are indicated with an asterisk. Additionally, Python notebooks that
will help you understand the ideas in this book are also available via the website, and
these are also referenced in the margins of the text. Indeed, if you are feeling rusty, it
Notebook 1.1
Background
mathematics
might be worth working through the notebook on background mathematics right now.
Unfortunately, the pace of research in AI makes it inevitable that this book will be a
constant work in progress. If there are parts you nd hard to understand, notable omis-
sions, or sections that seem extraneous, please get in touch via the associated website.
Together, we can make the next edition better.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Chapter 2
Supervised learning
A supervised learning model denes a mapping from one or more inputs to one or more
outputs. For example, the input might be the age and mileage of a secondhand Toyota
Prius, and the output might be the estimated value of the car in dollars.
The model is just a mathematical equation; when the inputs are passed through this
equation, it computes the output, and this is termed inference. The model equation also
contains parameters. Dierent parameter values change the outcome of the computa-
tion; the model equation describes a family of possible relationships between inputs and
outputs, and the parameters specify the particular relationship.
When we train or learn a model, we nd parameters that describe the true relationship
between inputs and outputs. A learning algorithm takes a training set of input/output
pairs and manipulates the parameters until the inputs predict their corresponding out-
puts as closely as possible. If the model works well for these training pairs, then we hope
it will make good predictions for new inputs where the true output is unknown.
The goal of this chapter is to expand on these ideas. First, we describe this framework
more formally and introduce some notation. Then we work through a simple example
in which we use a straight line to describe the relationship between input and output.
This linear model is both familiar and easy to visualize, but nevertheless illustrates all
the main ideas of supervised learning.
2.1 Supervised learning overview
In supervised learning, we aim to build a model that takes an input x and outputs a
prediction y. For simplicity, we assume that both the input x and output y are vectors
of a predetermined and xed size and that the elements of each vector are always ordered
in the same way; in the Prius example above, the input x would always contain the age
of the car and then the mileage, in that order. This is termed structured or tabular data.
To make the prediction, we need a model f[] that takes input x and returns y, so:
y = f[x]. (2.1)
Draft: please send errata to udlbookmail@gmail.com.
18 2 Supervised learning
When we compute the prediction y from the input x, we call this inference.
The model is just a mathematical equation with a xed form. It represents a family
of dierent relations between the input and the output. The model also contains param-
eters ϕ. The choice of parameters determines the particular relation between input and
output, so we should really write:
y = f[x, ϕ]. (2.2)
When we talk about learning or training a model, we mean that we attempt to nd
parameters ϕ that make sensible output predictions from the input. We learn these
parameters using a training dataset of I pairs of input and output examples {x
i
, y
i
}. We
aim to select parameters that map each training input to its associated output as closely
as possible. We quantify the degree of mismatch in this mapping with the loss L. This
is a scalar value that summarizes how poorly the model predicts the training outputs
from their corresponding inputs for parameters ϕ.
We can treat the loss as a function L[ϕ] of these parameters. When we train the
model, we are seeking parameters
ˆ
ϕ that minimize this loss function:
1
ˆ
ϕ = argmin
ϕ
h
L [ϕ]
i
. (2.3)
If the loss is small after this minimization, we have found model parameters that accu-
rately predict the training outputs y
i
from the training inputs x
i
.
After training a model, we must now assess its performance; we run the model on
separate
test data
to see how well it
generalizes
to examples that it didn’t observe during
training. If the performance is adequate, then we are ready to deploy the model.
2.2 Linear regression example
Let’s now make these ideas concrete with a simple example. We consider a model y =
f[x, ϕ] that predicts a single output y from a single input x. Then we develop a loss
function, and nally, we discuss model training.
2.2.1 1D linear regression model
A 1D linear regression model describes the relationship between input x and output y
as a straight line:
y = f[x, ϕ]
= ϕ
0
+ ϕ
1
x. (2.4)
1
More properly, the loss function also depends on the training data {x
i
, y
i
}, so we should
write L [{x
i
, y
i
}, ϕ], but this is rather cumbersome.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
2.2 Linear regression example 19
Figure 2.1 Linear regression model. For
a given choice of parameters ϕ =
[ϕ
0
, ϕ
1
]
T
, the model makes a predic-
tion for the output (y-axis) based on
the input (x-axis). Dierent choices
for the y-intercept ϕ
0
and the slope ϕ
1
change these predictions (cyan, orange,
and gray lines). The linear regression
model (equation 2.4) denes a family of
input/output relations (lines) and the
parameters determine the member of the
family (the particular line).
This model has two parameters ϕ = [ϕ
0
, ϕ
1
]
T
, where ϕ
0
is the y-intercept of the line
and ϕ
1
is the slope. Dierent choices for the y-intercept and slope result in dierent
relations between input and output (gure 2.1). Hence, equation 2.4 denes a fam-
ily of possible input-output relations (all possible lines), and the choice of parameters
determines the member of this family (the particular line).
2.2.2 Loss
For this model, the training dataset (gure 2.2a) consists of I input/output pairs {x
i
, y
i
}.
Figures 2.2b–d show three lines dened by three sets of parameters. The green line
in gure 2.2d describes the data more accurately than the other two since it is much
closer to the data points. However, we need a principled approach for deciding which
parameters ϕ are better than others. To this end, we assign a numerical value to each
choice of parameters that quanties the degree of mismatch between the model and the
data. We term this value the loss; a lower loss means a better t.
The mismatch is captured by the deviation between the model predictions f[x
i
, ϕ]
(height of the line at x
i
) and the ground truth outputs y
i
. These deviations are depicted
as orange dashed lines in gures 2.2b–d. We quantify the total mismatch, training error,
or loss as the sum of the squares of these deviations for all I training pairs:
L[ϕ] =
I
X
i=1
(f[x
i
, ϕ] y
i
)
2
=
I
X
i=1
(ϕ
0
+ ϕ
1
x
i
y
i
)
2
. (2.5)
Since the best parameters minimize this expression, we call this a least-squares loss. The
squaring operation means that the direction of the deviation (i.e., whether the line is
Draft: please send errata to udlbookmail@gmail.com.
20 2 Supervised learning
Figure 2.2 Linear regression training data, model, and loss. a) The training data
(orange points) consist of I = 12 input/output pairs {x
i
, y
i
}. b–d) Each panel
shows the linear regression model with dierent parameters. Depending on the
choice of y-intercept and slope parameters ϕ = [ϕ
0
, ϕ
1
]
T
, the model errors (orange
dashed lines) may be larger or smaller. The loss L is the sum of the squares of
these errors. The parameters that dene the lines in panels (b) and (c) have large
losses L = 7.07 and L = 10.28, respectively because the models t badly. The
loss L = 0.20 in panel (d) is smaller because the model ts well; in fact, this has
the smallest loss of all possible lines, so these are the optimal parameters.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
2.2 Linear regression example 21
Figure 2.3 Loss function for linear regression model with the dataset in gure 2.2a.
a) Each combination of parameters ϕ = [ϕ
0
,ϕ
1
]
T
has an associated loss. The re-
sulting loss function L[ϕ] can be visualized as a surface. The three circles repre-
sent the lines from gure 2.2b–d. b) The loss can also be visualized as a heatmap,
where brighter regions represent larger losses; here we are looking straight down
at the surface in (a) from above and gray ellipses represent isocontours. The best
tting line (gure 2.2d) has the parameters with the smallest loss (green circle).
above or below the data) is unimportant. There are also theoretical reasons for this
choice which we return to in chapter 5.
The loss L is a function of the parameters ϕ; it will be larger when the model t is
Notebook 2.1
Supervised
learning
poor (gure 2.2b,c) and smaller when it is good (gure 2.2d). Considered in this light,
we term L[ϕ] the loss function or cost function. The goal is to nd the parameters
ˆ
ϕ
that minimize this quantity:
ˆ
ϕ = argmin
ϕ
h
L[ϕ]
i
= argmin
ϕ
"
I
X
i=1
(f[x
i
, ϕ] y
i
)
2
#
= argmin
ϕ
"
I
X
i=1
(ϕ
0
+ ϕ
1
x
i
y
i
)
2
#
. (2.6)
There are only two parameters (the y-intercept ϕ
0
and slope ϕ
1
), so we can calculate
the loss for every combination of values and visualize the loss function as a surface
Problems 2.1–2.2
(gure 2.3). The “best” parameters are at the minimum of this surface.
Draft: please send errata to udlbookmail@gmail.com.
22 2 Supervised learning
2.2.3 Training
The process of nding parameters that minimize the loss is termed model tting, training,
or learning. The basic method is to choose the initial parameters randomly and then
improve them by “walking down” the loss function until we reach the bottom (gure 2.4).
One way to do this is to measure the gradient of the surface at the current position and
take a step in the direction that is most steeply downhill. Then we repeat this process
until the gradient is at and we can improve no further.
2
2.2.4 Testing
Having trained the model, we want to know how it will perform in the real world. We
do this by computing the loss on a separate set of test data. The degree to which the
prediction accuracy generalizes to the test data depends in part on how representative
and complete the training data is. However, it also depends on how expressive the model
is. A simple model like a line might not be able to capture the true relationship between
input and output. This is known as undertting. Conversely, a very expressive model
may describe statistical peculiarities of the training data that are atypical and lead to
unusual predictions. This is known as overtting.
2.3 Summary
A supervised learning model is a function y = f[x, ϕ] that relates inputs x to outputs y.
The particular relationship is determined by parameters ϕ. To train the model, we
dene a loss function L[ϕ] over a training dataset {x
i
, y
i
}. This quanties the mismatch
between the model predictions f[x
i
, ϕ] and observed outputs y
i
as a function of the
parameters ϕ. Then we search for the parameters that minimize the loss. We evaluate
the model on a dierent set of test data to see how well it generalizes to new inputs.
Chapters 3–9 expand on these ideas. First, we tackle the model itself; 1D linear
regression has the obvious drawback that it can only describe the relationship between the
input and output as a straight line. Shallow neural networks (chapter 3) are only slightly
more complex than linear regression but describe a much larger family of input/output
relationships. Deep neural networks (chapter 4) are just as expressive but can describe
complex functions with fewer parameters and work better in practice.
Chapter 5 investigates loss functions for dierent tasks and reveals the theoretical
underpinnings of the least-squares loss. Chapters 6 and 7 discuss the training process.
Chapter 8 discusses how to measure model performance. Chapter 9 considers regular-
ization techniques, which aim to improve that performance.
2
This iterative approach is not actually necessary for the linear regression model. Here, it’s possible
to nd closed-form expressions for the parameters. However, this gradient descent approach works for
more complex models where there is no closed-form solution and where there are too many parameters
to evaluate the loss for every combination of values.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 23
Figure 2.4 Linear regression training. The goal is to nd the y-intercept and slope
parameters that correspond to the smallest loss. a) Iterative training algorithms
initialize the parameters randomly and then improve them by “walking downhill”
until no further improvement can be made. Here, we start at position 0 and move
a certain distance downhill (perpendicular to the contours) to position 1. Then
we re-calculate the downhill direction and move to position 2. Eventually, we
reach the minimum of the function (position 4). b) Each position 0–4 from panel
(a) corresponds to a dierent y-intercept and slope and so represents a dierent
line. As the loss decreases, the lines t the data more closely.
Notes
Loss functions vs. cost functions: In much of machine learning and in this book, the terms
loss function and cost function are used interchangeably. However, more properly, a loss function
is the individual term associated with a data point (i.e., each of the squared terms on the right-
hand side of equation 2.5), and the cost function is the overall quantity that is minimized (i.e.,
the entire right-hand side of equation 2.5). A cost function can contain additional terms that
are not associated with individual data points (see section 9.1). More generally, an objective
function is any function that is to be maximized or minimized.
Generative vs. discriminative models: The models y = f[x, ϕ] in this chapter are discrim-
inative models. These make an output prediction y from real-world measurements x. Another
Problem 2.3
approach is to build a generative model x = g[y, ϕ], in which the real-world measurements x
are computed as a function of the output y.
The generative approach has the disadvantage that it doesn’t directly predict y. To perform
inference, we must invert the generative equation as y = g
1
[x, ϕ], and this may be dicult.
However, generative models have the advantage that we can build in prior knowledge about how
the data were created. For example, if we wanted to predict the 3D position and orientation y
Draft: please send errata to udlbookmail@gmail.com.
24 2 Supervised learning
of a car in an image x, then we could build knowledge about car shape, 3D geometry, and light
transport into the function x = g[y, ϕ].
This seems like a good idea, but in fact, discriminative models dominate modern machine
learning; the advantage gained from exploiting prior knowledge in generative models is usually
trumped by learning very exible discriminative models with large amounts of training data.
Problems
Problem 2.1 To walk “downhill” on the loss function (equation 2.5), we measure its gradient with
respect to the parameters ϕ
0
and ϕ
1
. Calculate expressions for the slopes L/∂ϕ
0
and L/∂ϕ
1
.
Problem 2.2 Show that we can nd the minimum of the loss function in closed form by setting
the expression for the derivatives from problem 2.1 to zero and solving for ϕ
0
and ϕ
1
. Note that
this works for linear regression but not for more complex models; this is why we use iterative
model tting methods like gradient descent (gure 2.4).
Problem 2.3
Consider reformulating linear regression as a generative model, so we have x =
g[y, ϕ] = ϕ
0
+ ϕ
1
y. What is the new loss function? Find an expression for the inverse func-
tion y = g
1
[x, ϕ] that we would use to perform inference. Will this model make the same
predictions as the discriminative version for a given training dataset {x
i
, y
i
}? One way to es-
tablish this is to write code that ts a line to three data points using both methods and see if
the result is the same.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Chapter 3
Shallow neural networks
Chapter 2 introduced supervised learning using 1D linear regression. However, this model
can only describe the input/output relationship as a line. This chapter introduces shallow
neural networks. These describe piecewise linear functions and are expressive enough
to approximate arbitrarily complex relationships between multi-dimensional inputs and
outputs.
3.1 Neural network example
Shallow neural networks are functions
y
=
f
[
x
,
ϕ
]
with parameters
ϕ
that map multivari-
ate inputs x to multivariate outputs y. We defer a full denition until section 3.4 and
introduce the main ideas using an example network f[x, ϕ] that maps a scalar input x to
a scalar output y and has ten parameters ϕ = {ϕ
0
, ϕ
1
, ϕ
2
, ϕ
3
, θ
10
, θ
11
, θ
20
, θ
21
, θ
30
, θ
31
}:
y = f[x, ϕ]
= ϕ
0
+ ϕ
1
a[θ
10
+ θ
11
x] + ϕ
2
a[θ
20
+ θ
21
x] + ϕ
3
a[θ
30
+ θ
31
x]. (3.1)
We can break down this calculation into three parts: rst, we compute three linear
functions of the input data (θ
10
+ θ
11
x, θ
20
+ θ
21
x, and θ
30
+ θ
31
x). Second, we pass the
three results through an activation function a[]. Finally, we weight the three resulting
activations with ϕ
1
, ϕ
2
, and ϕ
3
, sum them, and add an oset ϕ
0
.
To complete the description, we must dene the activation function a[]. There are
many possibilities, but the most common choice is the rectied linear unit or ReLU:
a[z] = ReLU[z] =
(
0 z < 0
z z 0
. (3.2)
This returns the input when it is positive and zero otherwise (gure 3.1).
It is probably not obvious which family of input/output relations is represented by
equation 3.1. Nonetheless, the ideas from the previous chapter are all applicable. Equa-
tion 3.1 represents a family of functions where the particular member of the family
Draft: please send errata to udlbookmail@gmail.com.
26 3 Shallow neural networks
Figure 3.1 Rectied linear unit (ReLU).
This activation function returns zero if
the input is less than zero and returns
the input unchanged otherwise. In other
words, it clips negative values to zero.
Note that there are many other possi-
ble choices for the activation function
(see gure 3.13), but the ReLU is the
most commonly used and the easiest to
understand.
Figure 3.2 Family of functions dened by equation 3.1. a–c) Functions for three
dierent choices of the ten parameters ϕ. In each case, the input/output relation
is piecewise linear. However, the positions of the joints, the slopes of the linear
regions between them, and the overall height vary.
depends on the ten parameters in ϕ. If we know these parameters, we can perform
inference (predict y) by evaluating the equation for a given input x. Given a training
dataset {x
i
, y
i
}
I
i=1
, we can dene a least squares loss function L[ϕ] and use this to mea-
sure how eectively the model describes this dataset for any given parameter values ϕ.
To train the model, we search for the values
ˆ
ϕ that minimize this loss.
3.1.1 Neural network intuition
In fact, equation 3.1 represents a family of continuous piecewise linear functions (g-
ure 3.2) with up to four linear regions. We now break down equation 3.1 and show why
it describes this family. To make this easier to understand, we split the function into
two parts. First, we introduce the intermediate quantities:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
3.1 Neural network example 27
h
1
= a[θ
10
+ θ
11
x]
h
2
= a[θ
20
+ θ
21
x]
h
3
= a[θ
30
+ θ
31
x], (3.3)
where we refer to h
1
, h
2
, and h
3
as hidden units. Second, we compute the output by
combining these hidden units with a linear function:
1
y = ϕ
0
+ ϕ
1
h
1
+ ϕ
2
h
2
+ ϕ
3
h
3
. (3.4)
Figure 3.3 shows the ow of computation that creates the function in gure 3.2a.
Each hidden unit contains a linear function θ
0
+ θ
1
x of the input, and that line is
clipped by the ReLU function a[] below zero. The positions where the three lines cross
zero become the three “joints” in the nal output. The three clipped lines are then
weighted by ϕ
1
, ϕ
2
, and ϕ
3
, respectively. Finally, the oset ϕ
0
is added, which controls
the overall height of the nal function.
Problems 3.1–3.8
Each linear region in gure 3.3j corresponds to a dierent activation pattern in the
hidden units. When a unit is clipped, we refer to it as inactive, and when it is not
clipped, we refer to it as active. For example, the shaded region receives contributions
from h
1
and h
3
(which are active) but not from h
2
(which is inactive). The slope of
each linear region is determined by (i) the original slopes θ
1
of the active inputs for this
region and (ii) the weights ϕ
that were subsequently applied. For example, the slope in
the shaded region (see problem 3.3) is θ
11
ϕ
1
+ θ
31
ϕ
3
, where the rst term is the slope in
panel (g) and the second term is the slope in panel (i).
Each hidden unit contributes one “joint” to the function, so with three hidden units,
Notebook 3.1
Shallow networks I
there can be four linear regions. However, only three of the slopes of these regions are
independent; the fourth is either zero (if all the hidden units are inactive in this region)
Problem 3.9
or is a sum of slopes from the other regions.
3.1.2 Depicting neural networks
We have been discussing a neural network with one input, one output, and three hidden
units. We visualize this network in gure 3.4a. The input is on the left, the hidden units
are in the middle, and the output is on the right. Each connection represents one of the
ten parameters. To simplify this representation, we do not typically draw the intercept
parameters, so this network is usually depicted as in gure 3.4b.
1
For the purposes of this book, a linear function has the form z
= ϕ
0
+
i
ϕ
i
z
i
. Any other type of
function is nonlinear. For instance, the ReLU function (equation 3.2) and the example neural network
that contains it (equation 3.1) are both nonlinear. See notes at end of chapter for further clarication.
Draft: please send errata to udlbookmail@gmail.com.
28 3 Shallow neural networks
Figure 3.3 Computation for function in gure 3.2a. a–c) The input
x
is passed
through three linear functions, each with a dierent y-intercept θ
0
and slope θ
1
.
d–f) Each line is passed through the ReLU activation function, which clips neg-
ative values to zero. g–i) The three clipped lines are then weighted (scaled) by
ϕ
1
, ϕ
2
, and ϕ
3
, respectively. j) Finally, the clipped and weighted functions are
summed, and an oset ϕ
0
that controls the height is added. Each of the four
linear regions corresponds to a dierent activation pattern in the hidden units.
In the shaded region, h
2
is inactive (clipped), but h
1
and h
3
are both active.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
3.2 Universal approximation theorem 29
Figure 3.4 Depicting neural networks. a) The input x is on the left, the hidden
units h
1
, h
2
, and h
3
in the center, and the output y on the right. Computation
ows from left to right. The input is used to compute the hidden units, which are
combined to create the output. Each of the ten arrows represents a parameter
(intercepts in orange and slopes in black). Each parameter multiplies its source
and adds the result to its target. For example, we multiply the parameter ϕ
1
by source h
1
and add it to y. We introduce additional nodes containing ones
(orange circles) to incorporate the osets into this scheme, so we multiply ϕ
0
by
one (with no eect) and add it to y. ReLU functions are applied at the hidden
units. b) More typically, the intercepts, ReLU functions, and parameter names
are omitted; this simpler depiction represents the same network.
3.2 Universal approximation theorem
In the previous section, we introduced an example neural network with one input, one
output, ReLU activation functions, and three hidden units. Let’s now generalize this
slightly and consider the case with D hidden units where the d
th
hidden unit is:
h
d
= a[θ
d0
+ θ
d1
x], (3.5)
and these are combined linearly to create the output:
y = ϕ
0
+
D
X
d=1
ϕ
d
h
d
. (3.6)
The number of hidden units in a shallow network is a measure of the network capacity.
With ReLU activation functions, the output of a network with D hidden units has at
Problem 3.10
most D joints and so is a piecewise linear function with at most D + 1 linear regions. As
we add more hidden units, the model can approximate more complex functions.
Indeed, with enough capacity (hidden units), a shallow network can describe any
continuous 1D function dened on a compact subset of the real line to arbitrary precision.
To see this, consider that every time we add a hidden unit, we add another linear region to
the function. As these regions become more numerous, they represent smaller sections
of the function, which are increasingly well approximated by a line (gure 3.5). The
universal approximation theorem proves that for any continuous function, there exists a
shallow network that can approximate this function to any specied precision.
Draft: please send errata to udlbookmail@gmail.com.
30 3 Shallow neural networks
Figure 3.5 Approximation of a 1D function (dashed line) by a piecewise linear
model. a–c) As the number of regions increases, the model becomes closer and
closer to the continuous function. A neural network with a scalar input creates
one extra linear region per hidden unit. The universal approximation theorem
proves that, with enough hidden units, there exists a shallow neural network that
can describe any given continuous function dened on a compact subset of R
D
i
to arbitrary precision.
3.3 Multivariate inputs and outputs
In the above example, the network has a single scalar input x and a single scalar output y.
However, the universal approximation theorem also holds for the more general case
where the network maps multivariate inputs x = [x
1
, x
2
, . . . , x
D
i
]
T
to multivariate output
predictions y = [y
1
, y
2
, . . . , y
D
o
]
T
. We rst explore how to extend the model to predict
multivariate outputs. Then we consider multivariate inputs. Finally, in section
3.4, we
present a general denition of a shallow neural network.
3.3.1 Visualizing multivariate outputs
To extend the network to multivariate outputs y, we simply use a dierent linear function
of the hidden units for each output. So, a network with a scalar input x, four hidden
units h
1
, h
2
, h
3
, and h
4
, and a 2D multivariate output y = [y
1
, y
2
]
T
would be dened as:
h
1
= a[θ
10
+ θ
11
x]
h
2
= a[θ
20
+ θ
21
x]
h
3
= a[θ
30
+ θ
31
x]
h
4
= a[θ
40
+ θ
41
x], (3.7)
and
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
3.3 Multivariate inputs and outputs 31
Figure 3.6 Network with one input, four hidden units, and two outputs. a)
Visualization of network structure. b) This network produces two piecewise linear
functions, y
1
[x] and y
2
[x]. The four “joints” of these functions (at vertical dotted
lines) are constrained to be in the same places since they share the same hidden
units, but the slopes and overall height may dier.
Figure 3.7 Visualization of neural net-
work with 2D multivariate input x =
[x
1
, x
2
]
T
and scalar output y.
y
1
= ϕ
10
+ ϕ
11
h
1
+ ϕ
12
h
2
+ ϕ
13
h
3
+ ϕ
14
h
4
y
2
= ϕ
20
+ ϕ
21
h
1
+ ϕ
22
h
2
+ ϕ
23
h
3
+ ϕ
24
h
4
. (3.8)
The two outputs are two dierent linear functions of the hidden units.
As we saw in gure 3.3, the “joints” in the piecewise functions depend on where the
initial linear functions θ
0
+ θ
1
x are clipped by the ReLU functions a[] at the hidden
units. Since both outputs y
1
and y
2
are dierent linear functions of the same four hidden
Problem 3.11
units, the four “joints” in each must be in the same places. However, the slopes of the
linear regions and the overall vertical oset can dier (gure 3.6).
3.3.2 Visualizing multivariate inputs
To cope with multivariate inputs x, we extend the linear relations between the input
and the hidden units. So a network with two inputs x = [x
1
, x
2
]
T
and a scalar output y
(gure 3.7) might have three hidden units dened by:
Draft: please send errata to udlbookmail@gmail.com.
32 3 Shallow neural networks
Figure 3.8 Processing in network with two inputs x = [x
1
, x
2
]
T
, three hidden
units h
1
, h
2
, h
3
, and one output y. a–c) The input to each hidden unit is a
linear function of the two inputs, which corresponds to an oriented plane. Bright-
ness indicates function output. For example, in panel (a), the brightness repre-
sents θ
10
+ θ
11
x
1
+ θ
12
x
2
. Thin lines are contours. d–f) Each plane is clipped by
the ReLU activation function (cyan lines are equivalent to “joints” in gures 3.3d–
f). g-i) The clipped planes are then weighted, and j) summed together with an
oset that determines the overall height of the surface. The result is a continuous
surface made up of convex piecewise linear polygonal regions.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
3.4 Shallow neural networks: general case 33
h
1
= a[θ
10
+ θ
11
x
1
+ θ
12
x
2
]
h
2
= a[θ
20
+ θ
21
x
1
+ θ
22
x
2
]
h
3
= a[θ
30
+ θ
31
x
1
+ θ
32
x
2
], (3.9)
where there is now one slope parameter for each input. The hidden units are combined
to form the output in the usual way:
y = ϕ
0
+ ϕ
1
h
1
+ ϕ
2
h
2
+ ϕ
3
h
3
. (3.10)
Figure 3.8 illustrates the processing of this network. Each hidden unit receives a linear
Problems 3.12–3.13
combination of the two inputs, which forms an oriented plane in the 3D input/output
Notebook 3.2
Shallow networks II
space. The activation function clips the negative values of these planes to zero. The
clipped planes are then recombined in a second linear function (equation 3.10) to create
a continuous piecewise linear surface consisting of convex polygonal regions (gure 3.8j).
Appendix B.1.2
Convex region
Each region corresponds to a dierent activation pattern. For example, in the central
triangular region, the rst and third hidden units are active, and the second is inactive.
When there are more than two inputs to the model, it becomes dicult to visualize.
However, the interpretation is similar. The output will be a continuous piecewise linear
function of the input, where the linear regions are now convex polytopes in the multi-
dimensional input space.
Note that as the input dimensions grow, the number of linear regions increases rapidly
(gure 3.9). To get a feeling for how rapidly, consider that each hidden unit denes a
hyperplane that delineates the part of space where this unit is active from the part
Notebook 3.3
Shallow network
regions
where it is not (cyan lines in 3.8d–f). If we had the same number of hidden units as
input dimensions D
i
, we could align each hyperplane with one of the coordinate axes
(gure 3.10). For two input dimensions, this would divide the space into four quadrants.
For three dimensions, this would create eight octants, and for D
i
dimensions, this would
create 2
D
i
orthants. Shallow neural networks usually have more hidden units than input
dimensions, so they typically create more than 2
D
i
linear regions.
3.4 Shallow neural networks: general case
We have described several example shallow networks to help develop intuition about how
they work. We now dene a general equation for a shallow neural network y = f[x, ϕ]
that maps a multi-dimensional input x R
D
i
to a multi-dimensional output y R
D
o
using h R
D
hidden units. Each hidden unit is computed as:
h
d
= a
"
θ
d0
+
D
i
X
i=1
θ
di
x
i
#
, (3.11)
and these are combined linearly to create the output:
Draft: please send errata to udlbookmail@gmail.com.
34 3 Shallow neural networks
Figure 3.9 Linear regions vs. hidden units. a) Maximum possible regions as a
function of the number of hidden units for ve dierent input dimensions D
i
=
{1, 5, 10, 50, 100}. The number of regions increases rapidly in high dimensions;
with D = 500 units and input size D
i
= 100, there can be greater than 10
107
regions (solid circle). b) The same data are plotted as a function of the number of
parameters. The solid circle represents the same model as in panel (a) with D =
500 hidden units. This network has 51, 001 parameters and would be considered
very small by modern standards.
Figure 3.10 Number of linear regions vs. input dimensions. a) With a single input
dimension, a model with one hidden unit creates one joint, which divides the axis
into two linear regions. b) With two input dimensions, a model with two hidden
units can divide the input space using two lines (here aligned with axes) to create
four regions. c) With three input dimensions, a model with three hidden units
can divide the input space using three planes (again aligned with axes) to create
eight regions. Continuing this argument, it follows that a model with D
i
input
dimensions and D
i
hidden units can divide the input space with D
i
hyperplanes
to create 2
D
i
linear regions.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
3.5 Terminology 35
Figure 3.11 Visualization of neural net-
work with three inputs and two out-
puts. This network has twenty param-
eters. There are fteen slopes (indicated
by arrows) and ve osets (not shown).
y
j
= ϕ
j0
+
D
X
d=1
ϕ
jd
h
d
, (3.12)
where a[] is a nonlinear activation function. The model has parameters ϕ = {θ
••
, ϕ
••
}.
Figure 3.11 shows an example with three inputs, three hidden units, and two outputs.
Problems 3.14–3.17
The activation function permits the model to describe nonlinear relations between
input and the output, and as such, it must be nonlinear itself; with no activation func-
tion, or a linear activation function, the overall mapping from input to output would
be restricted to be linear. Many dierent activation functions have been tried (see g-
ure 3.13), but the most common choice is the ReLU (gure 3.1), which has the merit
Notebook 3.4
Activation
functions
of being easily interpretable. With ReLU activations, the network divides the input
space into convex polytopes dened by the intersections of hyperplanes computed by
the “joints” in the ReLU functions. Each convex polytope contains a dierent linear
function. The polytopes are the same for each output, but the linear functions they
contain can dier.
3.5 Terminology
We conclude this chapter by introducing some terminology. Regrettably, neural networks
have a lot of associated jargon. They are often referred to in terms of layers. The left of
gure 3.12 is the input layer, the center is the hidden layer, and to the right is the output
layer. We would say that the network in gure 3.12 has one hidden layer containing
four hidden units. The hidden units themselves are sometimes referred to as neurons.
When we pass data through the network, the values of the inputs to the hidden layer
(i.e., before the ReLU functions are applied) are termed pre-activations. The values at
the hidden layer (i.e., after the ReLU functions) are termed activations.
For historical reasons, any neural network with at least one hidden layer is also called
a multi-layer perceptron, or MLP for short. Networks with one hidden layer (as described
in this chapter) are sometimes referred to as shallow neural networks. Networks with
multiple hidden layers (as described in the next chapter) are referred to as deep neural
networks. Neural networks in which the connections form an acyclic graph (i.e., a graph
with no loops, as in all the examples in this chapter) are referred to as feed-forward
networks. If every element in one layer connects to every element in the next (as in
all the examples in this chapter), the network is fully connected. These connections
Draft: please send errata to udlbookmail@gmail.com.
36 3 Shallow neural networks
Figure 3.12 Terminology. A shallow network consists of an input layer, a hidden
layer, and an output layer. Each layer is connected to the next by forward con-
nections (arrows). For this reason, these models are referred to as feed-forward
networks. When every variable in one layer connects to every variable in the
next, we call this a fully connected network. Each connection represents a slope
parameter in the underlying equation, and these parameters are termed weights.
The variables in the hidden layer are termed neurons or hidden units. The values
feeding into the hidden units are termed pre-activations, and the values at the
hidden units (i.e., after the ReLU function is applied) are termed activations.
represent slope parameters in the underlying equations and are referred to as network
weights. The oset parameters (not shown in gure 3.12) are called biases.
3.6 Summary
Shallow neural networks have one hidden layer. They (i) compute several linear functions
of the input, (ii) pass each result through an activation function, and then (iii) take a
linear combination of these activations to form the outputs. Shallow neural networks
make predictions y based on inputs x by dividing the input space into a continuous
surface of piecewise linear regions. With enough hidden units (neurons), shallow neural
networks can approximate any continuous function to arbitrary precision.
Chapter 4 discusses deep neural networks, which extend the models from this chapter
by adding more hidden layers. Chapters 5–7 describe how to train these models.
Notes
“Neural” networks: If the models in this chapter are just functions, why are they called
“neural networks”? The connection is, unfortunately, tenuous. Visualizations like gure 3.12
consist of nodes (inputs, hidden units, and outputs) that are densely connected to one another.
This bears a supercial similarity to neurons in the mammalian brain, which also have dense
connections. However, there is scant evidence that brain computation works in the same way
as neural networks, and it is unhelpful to think about biology going forward.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 37
Figure 3.13 Activation functions. a) Logistic sigmoid and tanh functions. b)
Leaky ReLU and parametric ReLU with parameter 0.25. c) SoftPlus, Gaussian
error linear unit, and sigmoid linear unit. d) Exponential linear unit with param-
eters 0.5 and 1.0, e) Scaled exponential linear unit. f) Swish with parameters 0.4,
1.0, and 1.4.
History of neural networks: McCulloch & Pitts (1943) rst came up with the notion of an
articial neuron that combined inputs to produce an output, but this model did not have a prac-
tical learning algorithm. Rosenblatt (1958) developed the perceptron, which linearly combined
inputs and then thresholded them to make a yes/no decision. He also provided an algorithm
to learn the weights from data. Minsky & Papert (1969) argued that the linear function was
inadequate for general classication problems but that adding hidden layers with nonlinear
activation functions (hence the term multi-layer perceptron) could allow the learning of more
general input/output relations. However, they concluded that Rosenblatt’s algorithm could not
learn the parameters of such models. It was not until the 1980s that a practical algorithm
(backpropagation, see chapter 7) was developed, and signicant work on neural networks re-
sumed. The history of neural networks is chronicled by Kurenkov (2020), Sejnowski (2018), and
Schmidhuber (2022).
Activation functions: The ReLU function has been used as far back as Fukushima (1969).
However, in the early days of neural networks, it was more common to use the logistic sigmoid or
tanh activation functions (gure 3.13a). The ReLU was re-popularized by Jarrett et al. (2009),
Nair & Hinton (2010), and Glorot et al. (2011) and is an important part of the success story of
modern neural networks. It has the nice property that the derivative of the output with respect
to the input is always one for inputs greater than zero. This contributes to the stability and
eciency of training (see chapter 7) and contrasts with the derivatives of sigmoid activation
Draft: please send errata to udlbookmail@gmail.com.
38 3 Shallow neural networks
functions, which saturate (become close to zero) for large positive and large negative inputs.
However, the ReLU function has the disadvantage that its derivative is zero for negative inputs.
If all the training examples produce negative inputs to a given ReLU function, then we cannot
improve the parameters feeding into this ReLU during training. The gradient with respect to
the incoming weights is locally at, so we cannot “walk downhill. This is known as the dying
ReLU problem. Many variations on the ReLU have been proposed to resolve this problem
(gure 3.13b), including (i) the leaky ReLU (Maas et al., 2013), which also has a linear output
for negative values with a smaller slope of 0.1, (ii) the parametric ReLU (He et al., 2015), which
treats the slope of the negative portion as an unknown parameter, and (iii) the concatenated
ReLU (Shang et al., 2016), which produces two outputs, one of which clips below zero (i.e., like
a typical ReLU) and one of which clips above zero.
A variety of smooth functions have also been investigated (gure 3.13c–d), including the soft-
plus function (Glorot et al., 2011), Gaussian error linear unit (Hendrycks & Gimpel, 2016),
sigmoid linear unit (Hendrycks & Gimpel, 2016), and exponential linear unit (Clevert et al.,
2015). Most of these are attempts to avoid the dying ReLU problem while limiting the gradient
for negative values. Klambauer et al. (2017) introduced the scaled exponential linear unit (g-
ure 3.13e), which is particularly interesting as it helps stabilize the variance of the activations
when the input variance has a limited range (see section 7.5). Ramachandran et al. (2017)
adopted an empirical approach to choosing an activation function. They searched the space
of possible functions to nd the one that performed best over a variety of supervised learning
tasks. The optimal function was found to be a[x] = x/(1 + exp[βx]), where β is a learned
parameter (gure 3.13f). They termed this function Swish. Interestingly, this was a rediscovery
of activation functions previously proposed by Hendrycks & Gimpel (2016) and Elfwing et al.
(2018). Howard et al. (2019) approximated Swish by the HardSwish function, which has a very
similar shape but is faster to compute:
HardSwish[z] =
0 z < 3
z(z + 3)/6 3 z 3
z z > 3
. (3.13)
There is no denitive answer as to which of these activations functions is empirically superior.
However, the leaky ReLU, parameterized ReLU, and many of the continuous functions can be
shown to provide minor performance gains over the ReLU in particular situations. We restrict
attention to neural networks with the basic ReLU function for the rest of this book because it’s
easy to characterize the functions they create in terms of the number of linear regions.
Universal approximation theorem: The width version of this theorem states that there
exists a network with one hidden layer containing a nite number of hidden units that can
approximate any specied continuous function on a compact subset of R
n
to arbitrary accuracy.
This was proved by Cybenko (1989) for a class of sigmoid activations and was later shown to
be true for a larger class of nonlinear activation functions (Hornik, 1991).
Number of linear regions: Consider a shallow network with D
i
2-dimensional inputs
and D hidden units. The number of linear regions is determined by the intersections of the D
hyperplanes created by the “joints” in the ReLU functions (e.g., gure 3.8d–f). Each region is
created by a dierent combination of the ReLU functions clipping or not clipping the input.
Appendix B.2
Binomial
coecient
The number of regions created by D hyperplanes in the D
i
D-dimensional input space was
Problem 3.18
shown by Zaslavsky (1975) to be at most
P
D
i
j=0
D
j
(i.e., a sum of binomial coecients). As a
rule of thumb, shallow neural networks almost always have a larger number D of hidden units
than input dimensions D
i
and create between 2
D
i
and 2
D
linear regions.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 39
Linear, ane, and nonlinear functions: Technically, a linear transformation f[] is any
function that obeys the principle of superposition, so f[a +b] = f[a] + f[b]. This denition implies
that f[2a] = 2f[a].The weighted sum f[h
1
, h
2
, h
3
] = ϕ
1
h
1
+ ϕ
2
h
2
+ ϕ
3
h
3
is linear, but once the
oset (bias) is added so f[h
1
, h
2
, h
3
] = ϕ
0
+ ϕ
1
h
1
+ ϕ
2
h
2
+ ϕ
3
h
3
, this is no longer true. To see
this, consider that the output is doubled when we double the arguments of the former function.
This is not the case for the latter function, which is more properly termed an ane function.
However, it is common in machine learning to conate these terms. We follow this convention
in this book and refer to both as linear. All other functions we will encounter are nonlinear.
Problems
Problem 3.1 What kind of mapping from input to output would be created if the activation
function in equation 3.1 was linear so that a[z] = ψ
0
+ ψ
1
z? What kind of mapping would be
created if the activation function was removed, so a[z] = z?
Problem 3.2 For each of the four linear regions in gure 3.3j, indicate which hidden units are
inactive and which are active (i.e., which do and do not clip their inputs).
Problem 3.3
Derive expressions for the positions of the “joints” in function in gure 3.3j in
terms of the ten parameters ϕ and the input x. Derive expressions for the slopes of the four
linear regions.
Problem 3.4
Draw a version of gure
3.3 where the y-intercept and slope of the third hidden
unit have changed as in gure 3.14c. Assume that the remaining parameters remain the same.
Figure 3.14 Processing in network with one input, three hidden units, and one
output for problem 3.4. a–c) The input to each hidden unit is a linear function of
the inputs. The rst two are the same as in gure 3.3, but the last one diers.
Problem 3.5 Prove that the following property holds for α R
+
:
ReLU[α · z] = α · ReLU[z]. (3.14)
This is known as the non-negative homogeneity property of the ReLU function.
Draft: please send errata to udlbookmail@gmail.com.
40 3 Shallow neural networks
Problem 3.6 Following on from problem 3.5, what happens to the shallow network dened in
equations 3.3 and 3.4 when we multiply the parameters θ
10
and θ
11
by a positive constant α
and divide the slope ϕ
1
by the same parameter α? What happens if α is negative?
Problem 3.7 Consider tting the model in equation 3.1 using a least squares loss function. Does
this loss function have a unique minimum? i.e., is there a single “best” set of parameters?
Problem 3.8 Consider replacing the ReLU activation function with (i) the Heaviside step func-
tion heaviside[z], (ii) the hyperbolic tangent function tanh[z], and (iii) the rectangular func-
tion rect[z], where:
heaviside[z] =
(
0 z < 0
1 z 0
rect[z] =
0 z < 0
1 0 z 1
0 z > 1
. (3.15)
Redraw a version of gure 3.3 for each of these functions. The original parameters were: ϕ =
{ϕ
0
, ϕ
1
, ϕ
2
, ϕ
3
, θ
10
, θ
11
, θ
20
, θ
21
, θ
30
, θ
31
} = {−0.23, 1.3, 1.3, 0.66, 0.2, 0.4, 0.9, 0.9, 1.1, 0.7}.
Provide an informal description of the family of functions that can be created by neural networks
with one input, three hidden units, and one output for each activation function.
Problem 3.9
Show that the third linear region in gure 3.3 has a slope that is the sum of the
slopes of the rst and fourth linear regions.
Problem 3.10 Consider a neural network with one input, one output, and three hidden units.
The construction in gure 3.3 shows how this creates four linear regions. Under what circum-
stances could this network produce a function with fewer than four linear regions?
Problem 3.11
How many parameters does the model in gure 3.6 have?
Problem 3.12 How many parameters does the model in gure 3.7 have?
Problem 3.13 What is the activation pattern for each of the seven regions in gure 3.8? In other
words, which hidden units are active (pass the input) and which are inactive (clip the input)
for each region?
Problem 3.14 Write out the equations that dene the network in gure 3.11. There should
be three equations to compute the three hidden units from the inputs and two equations to
compute the outputs from the hidden units.
Problem 3.15
What is the maximum possible number of 3D linear regions that can be created
by the network in gure 3.11?
Problem 3.16 Write out the equations for a network with two inputs, four hidden units, and
three outputs. Draw this model in the style of gure 3.11.
Problem 3.17
Equations 3.11 and 3.12 dene a general neural network with D
i
inputs, one
hidden layer containing D hidden units, and D
o
outputs. Find an expression for the number of
parameters in the model in terms of D
i
, D, and D
o
.
Problem 3.18
Show that the maximum number of regions created by a shallow network
with D
i
= 2-dimensional input, D
o
= 1-dimensional output, and D = 3 hidden units is seven, as
in gure 3.8j. Use the result of Zaslavsky (1975) that the maximum number of regions created
by partitioning a D
i
-dimensional space with D hyperplanes is
P
D
i
j=0
D
j
. What is the maximum
number of regions if we add two more hidden units to this model, so D = 5?
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Chapter 4
Deep neural networks
The last chapter described shallow neural networks, which have a single hidden layer.
This chapter introduces deep neural networks, which have more than one hidden layer.
With ReLU activation functions, both shallow and deep networks describe piecewise
linear mappings from input to output.
As the number of hidden units increases, shallow neural networks improve their
descriptive power. Indeed, with enough hidden units, shallow networks can describe
arbitrarily complex functions in high dimensions. However, it turns out that for some
functions, the required number of hidden units is impractically large. Deep networks can
produce many more linear regions than shallow networks for a given number of parame-
ters. Hence, from a practical standpoint, they can be used to describe a broader family
of functions.
4.1 Composing neural networks
To gain insight into the behavior of deep neural networks, we rst consider composing
two shallow networks so the output of the rst becomes the input of the second. Consider
two shallow networks with three hidden units each (gure 4.1a). The rst network takes
an input x and returns output y and is dened by:
h
1
= a[θ
10
+ θ
11
x]
h
2
= a[θ
20
+ θ
21
x]
h
3
= a[θ
30
+ θ
31
x], (4.1)
and
y = ϕ
0
+ ϕ
1
h
1
+ ϕ
2
h
2
+ ϕ
3
h
3
. (4.2)
The second network takes y as input and returns y
and is dened by:
Draft: please send errata to udlbookmail@gmail.com.
42 4 Deep neural networks
Figure 4.1 Composing two single-layer networks with three hidden units each. a)
The output y of the rst network constitutes the input to the second network. b)
The rst network maps inputs x [1, 1] to outputs y [1, 1] using a function
comprising three linear regions that are chosen so that they alternate the sign
of their slope (fourth linear region is outside range of graph). Multiple inputs x
(gray circles) now map to the same output y (cyan circle). c) The second network
denes a function comprising three linear regions that takes y and returns y
(i.e.,
the cyan circle is mapped to the brown circle). d) The combined eect of these
two functions when composed is that (i) three dierent inputs x are mapped to
any given value of y by the rst network and (ii) are processed in the same way by
the second network; the result is that the function dened by the second network
in panel (c) is duplicated three times, variously ipped and rescaled according to
the slope of the regions of panel (b).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
4.2 From composing networks to deep networks 43
h
1
= a[θ
10
+ θ
11
y]
h
2
= a[θ
20
+ θ
21
y]
h
3
= a[θ
30
+ θ
31
y], (4.3)
and
y
= ϕ
0
+ ϕ
1
h
1
+ ϕ
2
h
2
+ ϕ
3
h
3
. (4.4)
With ReLU activations, this model also describes a family of piecewise linear functions.
However, the number of linear regions is potentially greater than for a shallow network
with six hidden units. To see this, consider choosing the rst network to produce three
Problem 4.1
alternating regions of positive and negative slope (gure 4.1b). This means that three
dierent ranges of x are mapped to the same output range y [1, 1], and the subsequent
mapping from this range of y to y
is applied three times. The overall eect is that the
Notebook 4.1
Composing
networks
function dened by the second network is duplicated three times to create nine linear
regions. The same principle applies in higher dimensions (gure 4.2).
A dierent way to think about composing networks is that the rst network “folds”
the input space x back onto itself so that multiple inputs generate the same output.
Then the second network applies a function, which is replicated at all points that were
folded on top of one another (gure 4.3).
4.2 From composing networks to deep networks
The previous section showed that we could create complex functions by passing the
output of one shallow neural network into a second network. We now show that this is
a special case of a deep network with two hidden layers.
The output of the rst network (y = ϕ
0
+ ϕ
1
h
1
+ ϕ
2
h
2
+ ϕ
3
h
3
) is a linear combina-
tion of the activations at the hidden units. The rst operations of the second network
(equation 4.3 in which we calculate θ
10
+ θ
11
y, θ
20
+ θ
21
y, and θ
30
+ θ
31
y) are linear in
the output of the rst network. Applying one linear function to another yields another
linear function. Substituting the expression for y into equation 4.3 gives:
h
1
= a[θ
10
+ θ
11
y] = a[θ
10
+ θ
11
ϕ
0
+ θ
11
ϕ
1
h
1
+ θ
11
ϕ
2
h
2
+ θ
11
ϕ
3
h
3
]
h
2
= a[θ
20
+ θ
21
y] = a[θ
20
+ θ
21
ϕ
0
+ θ
21
ϕ
1
h
1
+ θ
21
ϕ
2
h
2
+ θ
21
ϕ
3
h
3
]
h
3
= a[θ
30
+ θ
31
y] = a[θ
30
+ θ
31
ϕ
0
+ θ
31
ϕ
1
h
1
+ θ
31
ϕ
2
h
2
+ θ
31
ϕ
3
h
3
], (4.5)
which we can rewrite as:
h
1
= a[ψ
10
+ ψ
11
h
1
+ ψ
12
h
2
+ ψ
13
h
3
]
h
2
= a[ψ
20
+ ψ
21
h
1
+ ψ
22
h
2
+ ψ
23
h
3
]
h
3
= a[ψ
30
+ ψ
31
h
1
+ ψ
32
h
2
+ ψ
33
h
3
], (4.6)
Draft: please send errata to udlbookmail@gmail.com.
44 4 Deep neural networks
Figure 4.2 Composing neural networks with a 2D input. a) The rst network
(from gure 3.8) has three hidden units and takes two inputs x
1
and x
2
and returns
a scalar output y. This is passed into a second network with two hidden units to
produce y
. b) The rst network produces a function consisting of seven linear
regions, one of which is at. c) The second network denes a function comprising
two linear regions in y [1, 1]. d) When these networks are composed, each of
the six non-at regions from the rst network is divided into two new regions by
the second network to create a total of 13 linear regions.
Figure 4.3 Deep networks as folding input space. a) One way to think about
the rst network from gure 4.1 is that it “folds” the input space back on top
of itself. b) The second network applies its function to the folded space. c) The
nal output is revealed by “unfolding” again.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
4.3 Deep neural networks 45
Figure 4.4 Neural network with one input, one output, and two hidden layers,
each containing three hidden units.
where ψ
10
= θ
10
+ θ
11
ϕ
0
, ψ
11
= θ
11
ϕ
1
, ψ
12
= θ
11
ϕ
2
and so on. The result is a network
with two hidden layers (gure 4.4).
It follows that a network with two layers can represent the family of functions created
by passing the output of one single-layer network into another. In fact, it represents a
broader family because in equation 4.6, the nine slope parameters ψ
11
, ψ
21
, . . . , ψ
33
can
take arbitrary values, whereas, in equation 4.5, these parameters are constrained to be
the outer product [θ
11
, θ
21
, θ
31
]
T
[ϕ
1
, ϕ
2
, ϕ
3
].
4.3 Deep neural networks
In the previous section, we showed that composing two shallow networks yields a special
case of a deep network with two hidden layers. Now we consider the general case of a
deep network with two hidden layers, each containing three hidden units (gure 4.4).
The rst layer is dened by:
h
1
= a[θ
10
+ θ
11
x]
h
2
= a[θ
20
+ θ
21
x]
h
3
= a[θ
30
+ θ
31
x], (4.7)
the second layer by:
h
1
= a[ψ
10
+ ψ
11
h
1
+ ψ
12
h
2
+ ψ
13
h
3
]
h
2
= a[ψ
20
+ ψ
21
h
1
+ ψ
22
h
2
+ ψ
23
h
3
]
h
3
= a[ψ
30
+ ψ
31
h
1
+ ψ
32
h
2
+ ψ
33
h
3
], (4.8)
and the output by:
y
= ϕ
0
+ ϕ
1
h
1
+ ϕ
2
h
2
+ ϕ
3
h
3
. (4.9)
Draft: please send errata to udlbookmail@gmail.com.
46 4 Deep neural networks
Considering these equations leads to another way to think about how the network con-
Notebook 4.2
Clipping
functions
structs an increasingly complicated function (gure 4.5):
1. The three hidden units h
1
, h
2
, and h
3
in the rst layer are computed as usual by
forming linear functions of the input and passing these through ReLU activation
functions (equation 4.7).
2. The pre-activations at the second layer are computed by taking three new linear
functions of these hidden units (arguments of the activation functions in equa-
tion 4.8). At this point, we eectively have a shallow network with three outputs;
we have computed three piecewise linear functions with the “joints” between linear
regions in the same places (see gure 3.6).
3. At the second hidden layer, another ReLU function a[] is applied to each function
(equation 4.8), which clips them and adds new “joints” to each.
4. The nal output is a linear combination of these hidden units (equation 4.9).
In conclusion, we can either think of each layer as “folding” the input space or as cre-
ating new functions, which are clipped (creating new regions) and then recombined. The
former view emphasizes the dependencies in the output function but not how clipping
creates new joints, and the latter has the opposite emphasis. Ultimately, both descrip-
tions provide only partial insight into how deep neural networks operate. Regardless,
it’s important not to lose sight of the fact that this is still merely an equation relating
input x to output y
. Indeed, we can combine equations 4.7–4.9 to get one expression:
y
= ϕ
0
+ ϕ
1
a [ψ
10
+ ψ
11
a[θ
10
+ θ
11
x] + ψ
12
a[θ
20
+ θ
21
x] + ψ
13
a[θ
30
+ θ
31
x]]
+ϕ
2
a[ψ
20
+ ψ
21
a[θ
10
+ θ
11
x] + ψ
22
a[θ
20
+ θ
21
x] + ψ
23
a[θ
30
+ θ
31
x]]
+ϕ
3
a[ψ
30
+ ψ
31
a[θ
10
+ θ
11
x] + ψ
32
a[θ
20
+ θ
21
x] + ψ
33
a[θ
30
+ θ
31
x]],
(4.10)
although this is admittedly rather dicult to understand.
4.3.1 Hyperparameters
We can extend the deep network construction to more than two hidden layers; modern
networks might have more than a hundred layers with thousands of hidden units at each
layer. The number of hidden units in each layer is referred to as the width of the network,
and the number of hidden layers as the depth. The total number of hidden units is a
measure of the network’s capacity.
We denote the number of layers as K and the number of hidden units in each layer
as D
1
, D
2
, . . . , D
K
. These are examples of hyperparameters. They are quantities chosen
Problem 4.2
before we learn the model parameters (i.e., the slope and intercept terms). For xed
hyperparameters (e.g., K = 2 layers with D
k
= 3 hidden units in each), the model
describes a family of functions, and the parameters determine the particular function.
Hence, when we also consider the hyperparameters, we can think of neural networks as
representing a family of families of functions relating input to output.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
4.3 Deep neural networks 47
Figure 4.5 Computation for the deep network in gure 4.4. a–c) The inputs
to the second hidden layer (i.e., the pre-activations) are three piecewise linear
functions where the “joints” between the linear regions are at the same places
(see gure 3.6). d–f) Each piecewise linear function is clipped to zero by the
ReLU activation function. g–i) These clipped functions are then weighted with
parameters ϕ
1
, ϕ
2
, and ϕ
3
, respectively. j) Finally, the clipped and weighted
functions are summed and an oset ϕ
0
that controls the overall height is added.
Draft: please send errata to udlbookmail@gmail.com.
48 4 Deep neural networks
Figure 4.6 Matrix notation for network with D
i
= 3-dimensional input x, D
o
= 2-
dimensional output y, and K = 3 hidden layers h
1
, h
2
, and h
3
of dimensions
D
1
= 4, D
2
= 2, and D
3
= 3 respectively. The weights are stored in matrices
k
that pre-multiply the activations from the preceding layer to create the pre-
activations at the subsequent layer. For example, the weight matrix
1
that
computes the pre-activations at h
2
from the activations at h
1
has dimension
2 ×4. It is applied to the four hidden units in layer one and creates the inputs to
the two hidden units at layer two. The biases are stored in vectors β
k
and have
the dimension of the layer into which they feed. For example, the bias vector β
2
is length three because layer h
3
contains three hidden units.
4.4 Matrix notation
We have seen that a deep neural network consists of linear transformations alternating
Appendix B.3
Matrices
with activation functions. We could equivalently describe equations 4.7–4.9 in matrix
notation as:
h
1
h
2
h
3
= a
θ
10
θ
20
θ
30
+
θ
11
θ
21
θ
31
x
, (4.11)
h
1
h
2
h
3
= a
ψ
10
ψ
20
ψ
30
+
ψ
11
ψ
12
ψ
13
ψ
21
ψ
22
ψ
23
ψ
31
ψ
32
ψ
33
h
1
h
2
h
3
, (4.12)
and
y
= ϕ
0
+
ϕ
1
ϕ
2
ϕ
3
h
1
h
2
h
3
, (4.13)
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
4.5 Shallow vs. deep neural networks 49
or even more compactly in matrix notation as:
h = a [θ
0
+ θx]
h
= a [ψ
0
+ Ψh]
y
= ϕ
0
+ ϕ
h
, (4.14)
where, in each case, the function a[] applies the activation function separately to every
element of its vector input.
4.4.1 General formulation
This notation becomes cumbersome for networks with many layers. Hence, from now
on, we will describe the vector of hidden units at layer k as h
k
, the vector of biases
(intercepts) that contribute to hidden layer k +1 as β
k
, and the weights (slopes) that
are applied to the k
th
layer and contribute to the (k+1)
th
layer as
k
. A general deep
network y = f[x, ϕ] with K layers can now be written as:
h
1
= a[β
0
+
0
x]
h
2
= a[β
1
+
1
h
1
]
h
3
= a[β
2
+
2
h
2
]
.
.
.
h
K
= a[β
K1
+
K1
h
K1
]
y = β
K
+
K
h
K
. (4.15)
The parameters ϕ of this model comprise all of these weight matrices and bias vectors
ϕ = {β
k
,
k
}
K
k=0
.
If the k
th
layer has D
k
hidden units, then the bias vector β
k1
will be of size D
k
.
The last bias vector β
K
has the size D
o
of the output. The rst weight matrix
0
has
Notebook 4.3
Deep networks
size D
1
×D
i
where D
i
is the size of the input. The last weight matrix
K
is D
o
×D
K
,
and the remaining matrices
k
are D
k+1
× D
k
(gure 4.6).
We can equivalently write the network as a single function:
Problems 4.3–4.6
y = β
K
+
K
a
β
K1
+
K1
a [. . . β
2
+
2
a [β
1
+
1
a [β
0
+
0
x]] . . .]
.
(4.16)
4.5 Shallow vs. deep neural networks
Chapter 3 discussed shallow networks (with a single hidden layer), and here we have
described deep networks (with multiple hidden layers). We now compare these models.
Draft: please send errata to udlbookmail@gmail.com.
50 4 Deep neural networks
4.5.1 Ability to approximate dierent functions
In section 3.2, we argued that shallow neural networks with enough capacity (hidden
units) could model any continuous function arbitrarily closely. In this chapter, we saw
that a deep network with two hidden layers could represent the composition of two
shallow networks. If the second of these networks computes the identity function, then
this deep network replicates a single shallow network. Hence, it can also approximate
any continuous function arbitrarily closely given sucient capacity.
Problem 4.7
4.5.2 Number of linear regions per parameter
A shallow network with one input, one output, and D > 2 hidden units can create up
to D + 1 linear regions and is dened by 3D + 1 parameters. A deep network with one
Problems 4.8–4.11
input, one output, and K layers of D > 2 hidden units can create a function with up to
(D + 1)
K
linear regions using 3D + 1 + (K 1)D(D + 1) parameters.
Figure 4.7a shows how the maximum number of linear regions increases as a function
of the number of parameters for networks mapping scalar input x to scalar output y.
Deep neural networks create much more complex functions for a xed parameter budget.
This eect is magnied as the number of input dimensions D
i
increases (gure 4.7b),
although computing the maximum number of regions is less straightforward.
This seems attractive, but the exibility of the functions is still limited by the number
of parameters. Deep networks can create extremely large numbers of linear regions, but
these contain complex dependencies and symmetries. We saw some of these when we
considered deep networks as “folding” the input space (gure 4.3). So, it’s not clear that
the greater number of regions is an advantage unless (i) there are similar symmetries in
the real-world functions that we wish to approximate or (ii) we have reason to believe
that the mapping from input to output really does involve a composition of simpler
functions.
4.5.3 Depth eciency
Both deep and shallow networks can model arbitrary functions, but some functions
can be approximated much more eciently with deep networks. Functions have been
identied that require a shallow network with exponentially more hidden units to achieve
an equivalent approximation to that of a deep network. This phenomenon is referred to
as the depth eciency of neural networks. This property is also attractive, but it’s not
clear that the real-world functions that we want to approximate fall into this category.
4.5.4 Large, structured inputs
We have discussed fully connected networks where every element of each layer contributes
to every element of the subsequent one. However, these are not practical for large,
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
4.5 Shallow vs. deep neural networks 51
Figure 4.7 The maximum number of linear regions for neural networks increases
rapidly with the network depth. a) Network with D
i
= 1 input. Each curve rep-
resents a xed number of hidden layers K, as we vary the number of hidden units
D per layer. For a xed parameter budget (horizontal position), deeper networks
produce more linear regions than shallower ones. A network with K = 5 layers
and D = 10 hidden units per layer has 471 parameters (highlighted point) and
can produce 161,051 regions. b) Network with D
i
= 10 inputs. Each subsequent
point along a curve represents ten hidden units. Here, a model with K = 5 layers
and D = 50 hidden units per layer has 10,801 parameters (highlighted point) and
can create more than 10
40
linear regions.
structured inputs like images, where the input might comprise 10
6
pixels. The number
of parameters would be prohibitive, and moreover, we want dierent parts of the image
to be processed similarly; there is no point in independently learning to recognize the
same object at every possible position in the image.
The solution is to process local image regions in parallel and then gradually integrate
information from increasingly large regions. This kind of local-to-global processing is
dicult to specify without using multiple layers (see chapter 10).
4.5.5 Training and generalization
A further possible advantage of deep networks over shallow networks is their ease of t-
ting; it is usually easier to train moderately deep networks than to train shallow ones (see
gure 20.2). It may be that over-parameterized deep models (i.e., those with more pa-
rameters than training examples) have a large family of roughly equivalent solutions that
are easy to nd. However, as we add more hidden layers, training becomes more dicult
again. Many methods have been developed to mitigate this problem (see chapter 11).
Deep neural networks also seem to generalize to new data better than shallow ones.
In practice, the best results for most tasks have been achieved using networks with tens
or hundreds of layers. Neither of these phenomena are well understood, and we return
to them in chapter 20.
Draft: please send errata to udlbookmail@gmail.com.
52 4 Deep neural networks
4.6 Summary
In this chapter, we rst considered what happens when we compose two shallow networks.
We argued that the rst network “folds” the input space, and the second network then
applies a piecewise linear function. The eects of the second network are duplicated
where the input space is folded onto itself.
We then showed that this composition of shallow networks is a special case of a deep
network with two layers. We interpreted the ReLU functions in each layer as clipping
the input functions in multiple places and creating more “joints” in the output function.
We introduced the idea of hyperparameters, which for the networks we’ve seen so far,
comprise the number of hidden layers and the number of hidden units in each.
Finally, we compared shallow and deep networks. We saw that (i) both networks
can approximate any function given enough capacity, (ii) deep networks produce many
more linear regions per parameter, (iii) some functions can be approximated much more
eciently by deep networks, (iv) large, structured inputs like images are best processed
in multiple stages, and (v) in practice, the best results for most tasks are achieved using
deep networks with many layers.
Now that we understand deep and shallow network models, we turn our attention to
training them. In the next chapter, we discuss loss functions. For any given parameter
values ϕ, the loss function returns a single number that indicates the mismatch between
the model outputs and the ground truth predictions for a training dataset. In chapters 6
and 7, we deal with the training process itself, in which we seek the parameter values
that minimize this loss.
Notes
Deep learning: It has long been understood that it is possible to build more complex functions
by composing shallow neural networks or developing networks with more than one hidden layer.
Indeed, the term “deep learning” was rst used by Dechter (1986). However, interest was limited
due to practical concerns; it was not possible to train such networks well. The modern era of
deep learning was kick-started by startling improvements in image classication reported by
Krizhevsky et al. (2012). This sudden progress was arguably due to the conuence of four
factors: larger training datasets, improved processing power for training, the use of the ReLU
activation function, and the use of stochastic gradient descent (see chapter 6). LeCun et al.
(2015) present an overview of early advances in the modern era of deep learning.
Number of linear regions: For deep networks using a total of D hidden units with ReLU
activations, the upper bound on the number of regions is 2
D
(Montúfar et al., 2014). The
same authors show that a deep ReLU network with D
i
-dimensional input and K layers, each
containing D D
i
hidden units, has O
(D/D
i
)
(K1)D
i
D
D
i
linear regions. Montúfar (2017),
Arora et al. (2016) and Serra et al. (2018) all provide tighter upper bounds that consider the
possibility that each layer has dierent numbers of hidden units. Serra et al. (2018) provide
an algorithm that counts the number of linear regions in a neural network, although it is only
practical for very small networks.
If the number of hidden units D in each of the K layers is the same, and D is an integer
multiple of the input dimensionality D
i
, then the maximum number of linear regions N
r
can be
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 53
computed exactly and is:
N
r
=
D
D
i
+ 1
D
i
(K1)
·
D
i
X
j=0
D
j
!
. (4.17)
The rst term in this expression corresponds to the rst K 1 layers of the network, which can
be thought of as repeatedly folding the input space. However, we now need to devote
D/D
i
hidden units to each input dimension to create these folds. The last term in this equation (a
sum of binomial coecients) is the number of regions that a shallow network can create and is
Appendix B.2
Binomial coecient
attributable to the last layer. For further information, consult Montúfar et al. (2014), Pascanu
et al. (2013), and Montúfar (2017).
Universal approximation theorem: We argued in section 4.5.1 that if the layers of a deep
network have enough hidden units, then the width version of the universal approximation the-
orem applies: there exists a network that can approximate any given continuous function on a
compact subset of R
D
i
to arbitrary accuracy. Lu et al. (2017) proved that there exists a network
with ReLU activation functions and at least
D
i
+ 4 hidden units in each layer can approximate
any specied D
i
-dimensional Lebesgue integrable function to arbitrary accuracy given enough
layers. This is known as the depth version of the universal approximation theorem.
Depth eciency: Several results show that there are functions that can be realized by deep
networks but not by any shallow network whose capacity is bounded above exponentially. In
other words, it would take an exponentially larger number of units in a shallow network to
describe these functions accurately. This is known as the depth eciency of neural networks.
Telgarsky (2016) shows that for any integer k, it is possible to construct networks with one input,
one output, and O[k
3
] layers of constant width, which cannot be realized with O[k] layers and
less than 2
k
width. Perhaps surprisingly, Eldan & Shamir (2016) showed that when there are
multivariate inputs, there is a three-layer network that cannot be realized by any two-layer
network if the capacity is sub-exponential in the input dimension. Cohen et al. (2016), Safran
& Shamir (2017), and Poggio et al. (2017) also demonstrate functions that deep networks can
approximate eciently, but shallow ones cannot. Liang & Srikant (2016) show that for a broad
class of functions, including univariate functions, shallow networks require exponentially more
hidden units than deep networks for a given upper bound on the approximation error.
Width eciency: Lu et al. (2017) investigate whether there are wide shallow networks (i.e.,
shallow networks with lots of hidden units) that cannot be realized by narrow networks whose
depth is not substantially larger. They show that there exist classes of wide, shallow networks
that can only be expressed by narrow networks with polynomial depth. This is known as the
width eciency of neural networks. This polynomial lower bound on width is less restrictive
than the exponential lower bound on depth, suggesting that depth is more important. Vardi
et al. (2022) subsequently showed that the price for making the width small is only a linear
increase in the network depth for networks with ReLU activations.
Problems
Problem 4.1
Consider composing the two neural networks in gure 4.8. Draw a plot of the
relationship between the input x and output y
for x [1, 1].
Problem 4.2 Identify the four hyperparameters in gure 4.6.
Problem 4.3 Using the non-negative homogeneity property of the ReLU function (see prob-
lem 3.5), show that:
Draft: please send errata to udlbookmail@gmail.com.
54 4 Deep neural networks
Figure 4.8 Composition of two networks for problem 4.1. a) The output y of the
rst network becomes the input to the second. b) The rst network computes
this function with output values y [1, 1]. c) The second network computes
this function on the input range y [1, 1].
ReLU
h
β
1
+λ
1
·
1
ReLU [β
0
+λ
0
·
0
x]
i
=λ
0
λ
1
· ReLU
1
λ
0
λ
1
β
1
+
1
ReLU
1
λ
0
β
0
+
0
x

,
(4.18)
where λ
0
and λ
1
are non-negative scalars. From this, we see that the weight matrices can be
rescaled by any magnitude as long as the biases are also adjusted, and the scale factors can be
re-applied at the end of the network.
Problem 4.4 Write out the equations for a deep neural network that takes D
i
= 5 inputs, D
o
= 4
outputs and has three hidden layers of sizes D
1
= 20, D
2
= 10, and D
3
= 7, respectively, in
both the forms of equations 4.15 and 4.16. What are the sizes of each weight matrix
and
bias vector β
?
Problem 4.5 Consider a deep neural network with D
i
= 5 inputs, D
o
= 1 output, and K = 20
hidden layers containing D = 30 hidden units each. What is the depth of this network? What
is the width?
Problem 4.6 Consider a network with D
i
= 1 input, D
o
= 1 output, and K = 10 layers, with
D = 10 hidden units in each. Would the number of weights increase more if we increased the
depth by one or the width by one? Provide your reasoning.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 55
Problem 4.7 Choose values for the parameters ϕ = {ϕ
0
, ϕ
1
, ϕ
2
, ϕ
3
, θ
10
, θ
11
, θ
20
, θ
21
, θ
30
, θ
31
} for
the shallow neural network in equation 3.1 (with ReLU activation functions) that will dene an
identity function over a nite range x [a, b].
Problem 4.8
Figure 4.9 shows the activations in the three hidden units of a shallow network
(as in gure 3.3). The slopes in the hidden units are 1.0, 1.0, and -1.0, respectively, and the
“joints” in the hidden units are at positions 1/6, 2/6, and 4/6. Find values of ϕ
0
, ϕ
1
, ϕ
2
, and ϕ
3
that will combine the hidden unit activations as ϕ
0
+ ϕ
1
h
1
+ ϕ
2
h
2
+ ϕ
3
h
3
to create a function
with four linear regions that oscillate between output values of zero and one. The slope of the
leftmost region should be positive, the next one negative, and so on. How many linear regions
will we create if we compose this network with itself? How many will we create if we compose
it with itself K times?
Problem 4.9
Following problem 4.8, is it possible to create a function with three linear regions
that oscillates back and forth between output values of zero and one using a shallow network
with two hidden units? Is it possible to create a function with ve linear regions that oscillates
in the same way using a shallow network with four hidden units?
Figure 4.9 Hidden unit activations for problem 4.8. a) First hidden unit has a
joint at position x = 1/6 and a slope of one in the active region. b) Second hidden
unit has a joint at position x = 2/6 and a slope of one in the active region. c)
Third hidden unit has a joint at position x = 4/6 and a slope of minus one in the
active region.
Problem 4.10 Consider a deep neural network with a single input, a single output, and K
hidden layers, each of which contains D hidden units. Show that this network will have a total
of 3D + 1 + (K 1)D(D + 1) parameters.
Problem 4.11
Consider two neural networks that map a scalar input x to a scalar output y.
The rst network is shallow and has D = 95 hidden units. The second is deep and has K = 10
layers, each containing D = 5 hidden units. How many parameters does each network have?
How many linear regions can each network make (see equation 4.17)? Which would run faster?
Draft: please send errata to udlbookmail@gmail.com.
Chapter 5
Loss functions
The last three chapters described linear regression, shallow neural networks, and deep
neural networks. Each represents a family of functions that map input to output, where
the particular member of the family is determined by the model parameters ϕ. When
we train these models, we seek the parameters that produce the best possible mapping
from input to output for the task we are considering. This chapter denes what is meant
by the “best possible” mapping.
That denition requires a training dataset {x
i
, y
i
} of input/output pairs. A loss
function or cost function L[ϕ] returns a single number that describes the mismatch
between the model predictions f[x
i
, ϕ] and their corresponding ground-truth outputs y
i
.
During training, we seek parameter values ϕ that minimize the loss and hence map the
training inputs to the outputs as closely as possible. We saw one example of a loss
function in chapter 2; the least squares loss function is suitable for univariate regression
problems for which the target is a real number y R. It computes the sum of the squares
Appendix A
Number sets
of the deviations between the model predictions f[x
i
, ϕ] and the true values y
i
.
This chapter provides a framework that both justies the choice of the least squares
criterion for real-valued outputs and allows us to build loss functions for other prediction
types. We consider binary classication, where the prediction y {0, 1} is one of two
categories, multiclass classication, where the prediction y {1, 2, . . . , K} is one of K
categories, and more complex cases. In the following two chapters, we address model
training, where the goal is to nd the parameter values that minimize these loss functions.
5.1 Maximum likelihood
In this section, we develop a recipe for constructing loss functions. Consider a model
f[x, ϕ] with parameters ϕ that computes an output from input x. Until now, we have
Appendix C.1.3
Conditional
probability
implied that the model directly computes a prediction y. We now shift perspective and
consider the model as computing a conditional probability distribution P r(y|x) over
possible outputs y given input x. The loss encourages each training output y
i
to have
a high probability under the distribution P r(y
i
|x
i
) computed from the corresponding
input x
i
(gure 5.1).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
5.1 Maximum likelihood 57
Figure 5.1 Predicting distributions over outputs. a) Regression task, where the
goal is to predict a real-valued output y from the input x based on training data
{x
i
, y
i
} (orange points). For each input value x, the machine learning model pre-
dicts a distribution P r(y|x) over the output y R (cyan curves show distributions
for x = 2.0 and x = 7.0). The loss function aims to maximize the probability of
the observed training outputs y
i
under the distribution predicted from the corre-
sponding inputs x
i
. b) To predict discrete classes y {1, 2, 3, 4} in a classication
task, we use a discrete probability distribution, so the model predicts a dierent
histogram over the four possible values of y
i
for each value of x
i
. c) To predict
counts y {0, 1, 2, . . .} and d) direction y (π, π], we use distributions dened
over positive integers and circular domains, respectively.
Draft: please send errata to udlbookmail@gmail.com.
58 5 Loss functions
5.1.1 Computing a distribution over outputs
This raises the question of exactly how a model f[x, ϕ] can be adapted to compute a
probability distribution. The solution is simple. First, we choose a parametric distribu-
tion P r(y|θ) dened on the output domain y. Then we use the network to compute one
or more of the parameters θ of this distribution.
For example, suppose the prediction domain is the set of real numbers, so y R.
Here, we might choose the univariate normal distribution, which is dened on R. This
distribution is dened by the mean µ and variance σ
2
, so θ = {µ, σ
2
}. The machine
learning model might predict the mean µ, and the variance σ
2
could be treated as an
unknown constant.
5.1.2 Maximum likelihood criterion
The model now computes dierent distribution parameters θ
i
= f[x
i
, ϕ] for each training
input x
i
. Each observed training output y
i
should have high probability under its
corresponding distribution P r(y
i
|θ
i
). Hence, we choose the model parameters ϕ so that
they maximize the combined probability across all I training examples:
ˆ
ϕ = argmax
ϕ
"
I
Y
i=1
P r(y
i
|x
i
)
#
= argmax
ϕ
"
I
Y
i=1
P r(y
i
|θ
i
)
#
= argmax
ϕ
"
I
Y
i=1
P r(y
i
|f[x
i
, ϕ])
#
. (5.1)
The combined probability term is the likelihood of the parameters, and hence equation 5.1
is known as the maximum likelihood criterion.
1
Here we are implicitly making two assumptions. First, we assume that the data
are identically distributed (the form of the probability distribution over the outputs y
i
is the same for each data point). Second, we assume that the conditional distribu-
tions P r(y
i
|x
i
) of the output given the input are independent, so the total likelihood of
Appendix C.1.5
Independence
the training data decomposes as:
P r(y
1
, y
2
, . . . , y
I
|x
1
, x
2
, . . . , x
I
) =
I
Y
i=1
P r(y
i
|x
i
). (5.2)
In other words, we assume the data are independent and identically distributed (i.i.d.).
1
A conditional probability P r(z|ψ) can be considered in two ways. As a function of z, it is a
probability distribution that sums to one. As a function of ψ, it is known as a likelihood and does not
generally sum to one.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
5.1 Maximum likelihood 59
Figure 5.2 The log transform. a) The log function is monotonically increasing.
If z > z
, then log[z] > log[z
]. It follows that the maximum of any function g[z]
will be at the same position as the maximum of log[g[z]]. b) A function g[z]. c)
The logarithm of this function log[g[z]]. All positions on g[z] with a positive slope
retain a positive slope after the log transform, and those with a negative slope
retain a negative slope. The position of the maximum remains the same.
5.1.3 Maximizing log-likelihood
The maximum likelihood criterion (equation 5.1) is not very practical. Each term
P r(y
i
|f[x
i
, ϕ]) can be small, so the product of many of these terms can be tiny. It
may be dicult to represent this quantity with nite precision arithmetic. Fortunately,
we can equivalently maximize the logarithm of the likelihood:
ˆ
ϕ = argmax
ϕ
"
I
Y
i=1
P r(y
i
|f[x
i
, ϕ])
#
= argmax
ϕ
"
log
"
I
Y
i=1
P r(y
i
|f[x
i
, ϕ])
##
= argmax
ϕ
"
I
X
i
=1
log
h
P r(y
i
|f[x
i
, ϕ])
i
#
. (5.3)
This log-likelihood criterion is equivalent because the logarithm is a monotonically in-
creasing function: if z > z
, then log[z] > log[z
] and vice versa (gure 5.2). It follows
that when we change the model parameters ϕ to improve the log-likelihood criterion, we
also improve the original maximum likelihood criterion. It also follows that the overall
maxima of the two criteria must be in the same place, so the best model parameters
ˆ
ϕ
are the same in both cases. However, the log-likelihood criterion has the practical ad-
vantage of using a sum of terms, not a product, so representing it with nite precision
isn’t problematic.
Draft: please send errata to udlbookmail@gmail.com.
60 5 Loss functions
5.1.4 Minimizing negative log-likelihood
Finally, we note that, by convention, model tting problems are framed in terms of
minimizing a loss. To convert the maximum log-likelihood criterion to a minimization
problem, we multiply by minus one, which gives us the negative log-likelihood criterion:
ˆ
ϕ = argmin
ϕ
"
I
X
i=1
log
h
P r(y
i
|f[x
i
, ϕ])
i
#
= argmin
ϕ
h
L[ϕ]
i
, (5.4)
which is what forms the nal loss function L[ϕ].
5.1.5 Inference
The network no longer directly predicts the outputs y but instead determines a proba-
bility distribution over y. When we perform inference, we often want a point estimate
rather than a distribution, so we return the maximum of the distribution:
ˆ
y = argmax
y
h
P r(y|f[x,
ˆ
ϕ])
i
. (5.5)
It is usually possible to nd an expression for this in terms of the distribution parame-
ters θ predicted by the model. For example, in the univariate normal distribution, the
maximum occurs at the mean µ.
5.2 Recipe for constructing loss functions
The recipe for constructing loss functions for training data {x
i
, y
i
} using the maximum
likelihood approach is hence:
1. Choose a suitable probability distribution P r(y|θ) dened over the domain of the
predictions y with distribution parameters θ.
2. Set the machine learning model f[x, ϕ] to predict one or more of these parameters,
so θ = f[x, ϕ] and P r(y|θ) = P r(y|f[x, ϕ]).
3. To train the model, nd the network parameters
ˆ
ϕ that minimize the negative
log-likelihood loss function over the training dataset pairs {x
i
, y
i
}:
ˆ
ϕ = argmin
ϕ
h
L[ϕ]
i
= argmin
ϕ
"
I
X
i=1
log
h
P r(y
i
|f[x
i
, ϕ])
i
#
. (5.6)
4. To perform inference for a new test example x, return either the full distribu-
tion P r(y|f[x,
ˆ
ϕ]) or the maximum of this distribution.
We devote most of the rest of this chapter to constructing loss functions for common
prediction types using this recipe.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
5.3 Example 1: univariate regression 61
Figure 5.3 The univariate normal distri-
bution (also known as the Gaussian dis-
tribution) is dened on the real line z
R and has parameters µ and σ
2
. The
mean µ determines the position of the
peak. The positive root of the vari-
ance σ
2
(the standard deviation) de-
termines the width of the distribution.
Since the total probability density sums
to one, the peak becomes higher as the
variance decreases and the distribution
becomes narrower.
5.3 Example 1: univariate regression
We start by considering univariate regression models. Here the goal is to predict a single
scalar output y R from input x using a model f[x, ϕ] with parameters ϕ. Following
the recipe, we choose a probability distribution over the output domain y. We select the
univariate normal (gure 5.3), which is dened over y R. This distribution has two
parameters (mean µ and variance σ
2
) and has a probability density function:
P r(y|µ, σ
2
) =
1
2πσ
2
exp
(y µ)
2
2σ
2
. (5.7)
Second, we set the machine learning model f[x, ϕ] to compute one or more of the param-
eters of this distribution. Here, we just compute the mean so µ = f[x, ϕ]:
P r(y|f[x, ϕ], σ
2
) =
1
2πσ
2
exp
(y f[x, ϕ])
2
2σ
2
. (5.8)
We aim to nd the parameters ϕ that make the training data {x
i
, y
i
} most probable
under this distribution (gure 5.4). To accomplish this, we choose a loss function L[ϕ]
based on the negative log-likelihood:
L[ϕ] =
I
X
i=1
log
P r(y
i
|f[x
i
, ϕ], σ
2
)
=
I
X
i=1
log
1
2πσ
2
exp
(y
i
f[x
i
, ϕ])
2
2σ
2

. (5.9)
When we train the model, we seek parameters
ˆ
ϕ that minimize this loss.
Draft: please send errata to udlbookmail@gmail.com.
62 5 Loss functions
5.3.1 Least squares loss function
Now let’s perform some algebraic manipulations on the loss function. We seek:
ˆ
ϕ = argmin
ϕ
"
I
X
i=1
log
1
2πσ
2
exp
(y
i
f[x
i
, ϕ])
2
2σ
2

#
= argmin
ϕ
"
I
X
i=1
log
1
2πσ
2
(y
i
f[x
i
, ϕ])
2
2σ
2
#
= argmin
ϕ
"
I
X
i=1
(y
i
f[x
i
, ϕ])
2
2σ
2
#
= argmin
ϕ
"
I
X
i=1
(y
i
f[x
i
, ϕ])
2
#
, (5.10)
where we have removed the rst term between the second and third lines because it does
not depend on ϕ. We have removed the denominator between the third and fourth lines,
as this is just a constant scaling factor that does not aect the position of the minimum.
The result of these manipulations is the least squares loss function that we originally
introduced when we discussed linear regression in chapter 2:
L[ϕ] =
I
X
i=1
y
i
f[x
i
, ϕ]
2
. (5.11)
We see that the least squares loss function follows naturally from the assumptions that the
Notebook 5.1
Least squares
loss
predictions are (i) independent and (ii) drawn from a normal distribution with mean µ =
f[x
i
, ϕ] (gure 5.4).
5.3.2 Inference
The network no longer directly predicts y but instead predicts the mean µ = f[x, ϕ] of
the normal distribution over y. When we perform inference, we usually want a single
“best” point estimate ˆy, so we take the maximum of the predicted distribution:
ˆy = argmax
y
h
P r(y|f[x,
ˆ
ϕ, σ
2
])
i
. (5.12)
For the univariate normal, the maximum position is determined by the mean parameter µ
(gure 5.3). This is precisely what the model computed, so ˆy = f[x,
ˆ
ϕ].
5.3.3 Estimating variance
To formulate the least squares loss function, we assumed that the network predicted the
mean of a normal distribution. The nal expression in equation 5.11 (perhaps surpris-
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
5.3 Example 1: univariate regression 63
Figure 5.4 Equivalence of least squares and maximum likelihood loss for the
normal distribution. a) Consider the linear model from gure 2.2. The least
squares criterion minimizes the sum of the squares of the deviations (dashed lines)
between the model prediction f[x
i
, ϕ] (green line) and the true output values y
i
(orange points). Here the t is good, so these deviations are small (e.g., for the
two highlighted points). b) For these parameters, the t is bad, and the squared
deviations are large. c) The least squares criterion follows from the assumption
that the model predicts the mean of a normal distribution over the outputs and
that we maximize the probability. For the rst case, the model ts well, so the
probability P r(y
i
|x
i
) of the data (horizontal orange dashed lines) is large (and
the negative log probability is small). d) For the second case, the model ts badly,
so the probability is small and the negative log probability is large.
Draft: please send errata to udlbookmail@gmail.com.
64 5 Loss functions
ingly) does not depend on the variance σ
2
. However, there is nothing to stop us from
treating σ
2
as a parameter of the model and minimizing equation 5.9 with respect to
both the model parameters ϕ and the distribution variance σ
2
:
ˆ
ϕ, ˆσ
2
= argmin
ϕ
2
"
I
X
i=1
log
1
2πσ
2
exp
(y
i
f[x
i
, ϕ])
2
2σ
2

#
. (5.13)
In inference, the model predicts the mean µ = f[x,
ˆ
ϕ] from the input, and we learned the
variance ˆσ
2
during the training process. The former is the best prediction. The latter
tells us about the uncertainty of the prediction.
5.3.4 Heteroscedastic regression
The model above assumes that the variance of the data is constant everywhere. However,
this might be unrealistic. When the uncertainty of the model varies as a function of the
input data, we refer to this as heteroscedastic (as opposed to homoscedastic, where the
uncertainty is constant).
A simple way to model this is to train a neural network f[x, ϕ] that computes both
the mean and the variance. For example, consider a shallow network with two outputs.
We denote the rst output as f
1
[x, ϕ] and use this to predict the mean, and we denote
the second output as f
2
[x, ϕ] and use it to predict the variance.
There is one complication; the variance must be positive, but we can’t guarantee
that the network will always produce a positive output. To ensure that the computed
variance is positive, we pass the second network output through a function that maps
an arbitrary value to a positive one. A suitable choice is the squaring function, giving:
µ = f
1
[x, ϕ]
σ
2
= f
2
[x, ϕ]
2
, (5.14)
which results in the loss function:
ˆ
ϕ = argmin
ϕ
"
I
X
i=1
log
"
1
p
2πf
2
[x
i
, ϕ]
2
#
(y
i
f
1
[x
i
, ϕ])
2
2f
2
[x
i
, ϕ]
2
#
. (5.15)
Homoscedastic and heteroscedastic models are compared in gure 5.5.
5.4 Example 2: binary classication
In binary classication, the goal is to assign the data x to one of two discrete classes y
{0, 1}. In this context, we refer to y as a label. Examples of binary classication include
(i) predicting whether a restaurant review is positive (y = 1) or negative (y = 0) from
text data x and (ii) predicting whether a tumor is present (y = 1) or absent (y = 0 )
from an MRI scan x.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
5.4 Example 2: binary classication 65
Figure 5.5 Homoscedastic vs. heteroscedastic regression. a) A shallow neural
network for homoscedastic regression predicts just the mean µ of the output
distribution from the input x. b) The result is that while the mean (blue line)
is a piecewise linear function of the input x, the variance is constant everywhere
(arrows and gray region show ±2 standard deviations). c) A shallow neural
network for heteroscedastic regression also predicts the variance σ
2
(or, more
precisely, computes its square root, which we then square). d) The standard
deviation now also becomes a piecewise linear function of the input x.
Figure 5.6 Bernoulli distribution. The
Bernoulli distribution is dened on the
domain z {0, 1} and has a single pa-
rameter λ that denotes the probability
of observing z = 1. It follows that the
probability of observing z = 0 is 1 λ.
Draft: please send errata to udlbookmail@gmail.com.
66 5 Loss functions
Figure 5.7 Logistic sigmoid function.
This function maps the real line z
R to numbers between zero and one,
so sig[z] [0, 1]. An input of 0 is mapped
to 0.5. Negative inputs are mapped to
numbers below 0.5, and positive inputs
to numbers above 0.5.
Once again, we follow the recipe from section 5.2 to construct the loss function. First,
we choose a probability distribution over the output space y {0, 1}. A suitable choice
is the Bernoulli distribution, which is dened on the domain {0, 1}. This has a single
parameter λ [0, 1] that represents the probability that y takes the value one (gure 5.6):
P r(y|λ) =
(
1 λ y = 0
λ y = 1
, (5.16)
which can equivalently be written as:
P r(y|λ) = (1 λ)
1y
· λ
y
. (5.17)
Second, we set the machine learning model f[x, ϕ] to predict the single distribution
parameter λ. However, λ can only take values in the range [0, 1], and we cannot guarantee
that the network output will lie in this range. Consequently, we pass the network output
through a function that maps the real numbers R to [0, 1]. A suitable function is the
logistic sigmoid (gure 5.7):
Problem 5.1
sig[z] =
1
1 + exp[z]
. (5.18)
Hence, we predict the distribution parameter as λ = sig[f[x, ϕ]]. The likelihood is now:
P r(y|x) = (1 sig[f[x, ϕ]])
1y
· sig[f[x, ϕ]]
y
. (5.19)
This is depicted in gure 5.8 for a shallow neural network model. The loss function is
the negative log-likelihood of the training set:
L[ϕ] =
I
X
i
=1
(1 y
i
) log
h
1 sig[f[x
i
, ϕ]]
i
y
i
log
h
sig[f[x
i
, ϕ]]
i
. (5.20)
For reasons to be explained in section 5.7, this is known as the binary cross-entropy loss.
The transformed model output sig[f[x, ϕ]] predicts the parameter λ of the Bernoulli
Notebook 5.2
Binary
cross-entropy loss
distribution. This represents the probability that y = 1, and it follows that 1 λ
represents the probability that y = 0. When we perform inference, we may want a point
Problem 5.2
estimate of y, so we set y = 1 if λ > 0.5 and y = 0 otherwise.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
5.5 Example 3: multiclass classication 67
Figure 5.8 Binary classication model. a) The network output is a piecewise
linear function that can take arbitrary real values. b) This is transformed by the
logistic sigmoid function, which compresses these values to the range [0, 1]. c)
The transformed output predicts the probability λ that y = 1 (solid line). The
probability that y = 0 is hence 1 λ (dashed line). For any xed x (vertical
slice), we retrieve the two values of a Bernoulli distribution similar to that in
gure 5.6. The loss function favors model parameters that produce large values
of λ at positions x
i
that are associated with positive examples y
i
= 1 and small
values of λ at positions associated with negative examples y
i
= 0.
Figure 5.9 Categorical distribution. The
categorical distribution assigns probabil-
ities to K >2 categories, with associated
probabilities λ
1
, λ
2
, . . . , λ
K
. Here, there
are ve categories, so K = 5. To ensure
that this is a valid probability distribu-
tion, each parameter λ
k
must lie in the
range [0, 1], and all K parameters must
sum to one.
5.5 Example 3: multiclass classication
The goal of multiclass classication is to assign an input data example x to one of K > 2
classes, so y {1, 2, . . . , K}. Real-world examples include (i) predicting which of K = 10
digits y is present in an image x of a handwritten number and (ii) predicting which of K
possible words y follows an incomplete sentence x.
We once more follow the recipe from section 5.2. We rst choose a distribution
over the prediction space y. In this case, we have y {1, 2, . . . , K}, so we choose
the categorical distribution (gure 5.9), which is dened on this domain. This has K
parameters λ
1
, λ
2
, . . . , λ
K
, which determine the probability of each category:
Draft: please send errata to udlbookmail@gmail.com.
68 5 Loss functions
Figure 5.10 Multiclass classication for K = 3 classes. a) The network has three
piecewise linear outputs, which can take arbitrary values. b) After the softmax
function, these outputs are constrained to be non-negative and sum to one. Hence,
for a given input x, we compute valid parameters for the categorical distribution:
any vertical slice of this plot produces three values sum to one and would form
the heights of the bars in a categorical distribution similar to gure 5.9.
P r(y = k) = λ
k
. (5.21)
The parameters are constrained to take values between zero and one, and they must
collectively sum to one to ensure a valid probability distribution.
Then we use a network f[x, ϕ] with K outputs to compute these K parameters from
the input x. Unfortunately, the network outputs will not necessarily obey the afore-
mentioned constraints. Consequently, we pass the K outputs of the network through a
function that ensures these constraints are respected. A suitable choice is the softmax
function (gure 5.10). This takes an arbitrary vector of length K and returns a vector
of the same length but where the elements are now in the range [0, 1] and sum to one.
The k
th
output of the softmax function is:
softmax
k
[z] =
exp[z
k
]
P
K
k
=1
exp[z
k
]
, (5.22)
where the exponential functions ensure positivity, and the sum in the denominator en-
Appendix B.1.3
Exponential
function
sures that the K numbers sum to one.
The likelihood that input x has label y = k (gure 5.10) is hence:
P r(y = k|x) = softmax
k
h
f[x, ϕ]
i
. (5.23)
The loss function is the negative log-likelihood of the training data:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
5.6 Multiple outputs 69
L[ϕ] =
I
X
i=1
log
h
softmax
y
i
h
f [x
i
, ϕ]
ii
=
I
X
i=1
f
y
i
[x
i
, ϕ] log
"
K
X
k
=1
exp [ f
k
[x
i
, ϕ]]
#!
, (5.24)
where f
k
[x, ϕ] denotes the k
th
output of the neural network. For reasons that will be
explained in section 5.7, this is known as the multiclass cross-entropy loss.
The transformed model output represents a categorical distribution over possible
Notebook 5.3
Multiclass
cross-entropy loss
classes y {1, 2, . . . , K}. For a point estimate, we take the most probable category ˆy =
argmax
k
[P r(y = k|f[x,
ˆ
ϕ])]. This corresponds to whichever curve is highest for that
value of x in gure 5.10.
5.5.1 Predicting other data types
In this chapter, we have focused on regression and classication because these problems
are widespread. However, to make dierent types of predictions, we simply choose an
appropriate distribution over that domain and apply the recipe in section 5.2. Figure 5.11
enumerates a series of probability distributions and their prediction domains. Some of
Problems 5.3–5.6
these are explored in the problems at the end of the chapter.
5.6 Multiple outputs
Often, we wish to make more than one prediction with the same model, so the target
output y is a vector. For example, we might want to predict a molecule’s melting
and boiling point (a multivariate regression problem, gure 1.2b) or the object class at
every point in an image (a multivariate classication problem, gure 1.4a). While it
is possible to dene multivariate probability distributions and use a neural network to
model their parameters as a function of the input, it is more usual to treat each prediction
as independent.
Independence implies that we treat the probability P r(y|f[x, ϕ]) as a product of
Appendix C.1.5
Independence
univariate terms for each element y
d
y:
P r(y|f[x, ϕ]) =
Y
d
P r(y
d
|f
d
[x, ϕ]), (5.25)
where f
d
[x, ϕ] is the d
th
set of network outputs, which describe the parameters of the
distribution over y
d
. For example, to predict multiple continuous variables y
d
R, we
use a normal distribution for each y
d
, and the network outputs f
d
[x, ϕ] predict the means
of these distributions. To predict multiple discrete variables y
d
{1, 2, . . . , K}, we use a
categorical distribution for each y
d
. Here, each set of network outputs f
d
[x, ϕ] predicts
the K values that contribute to the categorical distribution for y
d
.
Draft: please send errata to udlbookmail@gmail.com.
70 5 Loss functions
Data Type Domain Distribution Use
univariate, continuous, y R univariate regression
unbounded normal
univariate, continuous, y R Laplace robust
unbounded or t-distribution regression
univariate, continuous, y R mixture of multimodal
unbounded Gaussians regression
univariate, continuous, y R
+
exponential predicting
bounded below or gamma magnitude
univariate, continuous, y [0, 1] beta predicting
bounded proportions
multivariate, continuous, y R
K
multivariate multivariate
unbounded normal regression
univariate, continuous, y (π, π] von Mises predicting
circular direction
univariate, discrete, y {0, 1} Bernoulli binary
binary classication
univariate, discrete, y {1, 2, . . . , K} categorical multiclass
bounded classication
univariate, discrete, y [0, 1, 2, 3, . . .] Poisson predicting
bounded below event counts
multivariate, discrete, y Perm[1, 2, . . . , K] Plackett-Luce ranking
permutation
Figure 5.11 Distributions for loss functions for dierent prediction types.
When we minimize the negative log probability, this product becomes a sum of terms:
L[ϕ] =
I
X
i=1
log
h
P r(y
i
|f[x
i
, ϕ])
i
=
I
X
i=1
X
d
log
h
P r(y
id
|f
d
[x
i
, ϕ])
i
. (5.26)
where y
id
is the d
th
output from the i
th
training example.
To make two or more prediction types simultaneously, we similarly assume the errors
in each are independent. For example, to predict wind direction and strength, we might
Problems 5.7–5.10
choose the von Mises distribution (dened on circular domains) for the direction and
the exponential distribution (dened on positive real numbers) for the strength. The
independence assumption implies that the joint likelihood of the two predictions is the
product of individual likelihoods. These terms will become additive when we compute
the negative log-likelihood.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
5.7 Cross-entropy loss 71
Figure 5.12 Cross-entropy method. a) Empirical distribution of training samples
(arrows denote Dirac delta functions). b) Model distribution (a normal distribu-
tion with parameters θ = {µ, σ
2
}). In the cross-entropy approach, we minimize
the distance (KL divergence) between these two distributions as a function of the
model parameters θ.
5.7 Cross-entropy loss
In this chapter, we developed loss functions that minimize negative log-likelihood. How-
ever, the term cross-entropy loss is also commonplace. In this section, we describe the
cross-entropy loss and show that it is equivalent to using negative log-likelihood.
The cross-entropy loss is based on the idea of nding parameters θ that minimize the
distance between the empirical distribution q(y) of the observed data y and a model dis-
tribution P r(y|θ) (gure 5.12). The distance between two probability distributions q(z)
Appendix C.5.1
KL Divergence
and p(z) can be evaluated using the Kullback-Leibler (KL) divergence:
D
KL
q||p
=
Z
−∞
q(z) log
q(z)
dz
Z
−∞
q(z) log
p(z)
dz. (5.27)
Now consider that we observe an empirical data distribution at points {y
i
}
I
i=1
. We
can describe this as a weighted sum of point masses:
q(y) =
1
I
I
X
i=1
δ[y y
i
], (5.28)
where δ[] is the Dirac delta function. We want to minimize the KL divergence between
Appendix B.1.3
Dirac delta
function
the model distribution P r(y|θ) and this empirical distribution:
ˆ
θ = argmin
θ
Z
−∞
q(y) log
q(y)
dy
Z
−∞
q(y) log
P r(y|θ)
dy
= argmin
θ
Z
−∞
q(y) log
P r(y|θ)
dy
, (5.29)
Draft: please send errata to udlbookmail@gmail.com.
72 5 Loss functions
where the rst term disappears, as it has no dependence on θ. The remaining second
term is known as the cross-entropy. It can be interpreted as the amount of uncertainty
that remains in one distribution after taking into account what we already know from
the other. Now, we substitute in the denition of q(y) from equation 5.28:
ˆ
θ = argmin
θ
"
Z
−∞
1
I
I
X
i=1
δ[y y
i
]
!
log
P r(y|θ)
dy
#
= argmin
θ
"
1
I
I
X
i=1
log
P r(y
i
|θ)
#
= argmin
θ
"
I
X
i=1
log
P r(y
i
|θ)
#
. (5.30)
The product of the two terms in the rst line corresponds to pointwise multiplying the
point masses in gure 5.12a with the logarithm of the distribution in gure 5.12b. We
are left with a nite set of weighted probability masses centered on the data points. In
the last line, we have eliminated the constant scaling factor 1/I, as this does not aect
the position of the minimum.
In machine learning, the distribution parameters θ are computed by the model f[x
i
, ϕ],
so we have:
ˆ
ϕ = argmin
ϕ
"
I
X
i=1
log
P r(y
i
|f[x
i
, ϕ])
#
. (5.31)
This is precisely the negative log-likelihood criterion from the recipe in section 5.2.
It follows that the negative log-likelihood criterion (from maximizing the data likeli-
hood) and the cross-entropy criterion (from minimizing the distance between the model
and empirical data distributions) are equivalent.
5.8 Summary
We previously considered neural networks as directly predicting outputs y from data x.
In this chapter, we shifted perspective to think about neural networks as computing the
parameters θ of probability distributions P r(y|θ) over the output space. This led to a
principled approach to building loss functions. We selected model parameters ϕ that
maximized the likelihood of the observed data under these distributions. We saw that
this is equivalent to minimizing the negative log-likelihood.
The least squares criterion for regression is a natural consequence of this approach;
it follows from the assumption that y is normally distributed and that we are predicting
the mean. We also saw how the regression model could be (i) extended to estimate the
uncertainty over the prediction and (ii) extended to make that uncertainty dependent
on the input (the heteroscedastic model). We applied the same approach to both binary
and multiclass classication and derived loss functions for each. We discussed how to
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 73
tackle more complex data types and how to deal with multiple outputs. Finally, we
argued that cross-entropy is an equivalent way to think about tting models.
In previous chapters, we developed neural network models. In this chapter, we de-
veloped loss functions for deciding how well a model describes the training data for a
given set of parameters. The next chapter considers model training, in which we aim to
nd the model parameters that minimize this loss.
Notes
Losses based on the normal distribution: Nix & Weigend (1994) and Williams (1996)
investigated heteroscedastic nonlinear regression in which both the mean and the variance of
the output are functions of the input. In the context of unsupervised learning, Burda et al.
(2016) use a loss function based on a multivariate normal distribution with diagonal covariance,
and Dorta et al. (2018) use a loss function based on a normal distribution with full covariance.
Robust regression: Qi et al. (2020) investigate the properties of regression models that min-
imize mean absolute error rather than mean squared error. This loss function follows from
assuming a Laplace distribution over the outputs and estimates the median output for a given
input rather than the mean. Barron (2019) presents a loss function that parameterizes the de-
gree of robustness. When interpreted in a probabilistic context, it yields a family of univariate
probability distributions that includes the normal and Cauchy distributions as special cases.
Estimating quantiles: Sometimes, we may not want to estimate the mean or median in a
regression task but may instead want to predict a quantile. For example, this is useful for risk
models, where we want to know that the true value will be less than the predicted value 90%
of the time. This is known as quantile regression (Koenker & Hallock, 2001). This could be
done by tting a heteroscedastic regression model and then estimating the quantile based on
the predicted normal distribution. Alternatively, the quantiles can be estimated directly using
quantile loss (also known as pinball loss). In practice, this minimizes the absolute deviations
of the data from the model but weights the deviations in one direction more than the other.
Recent work has investigated simultaneously predicting multiple quantiles to get an idea of the
overall distribution shape (Rodrigues & Pereira, 2020).
Class imbalance and focal loss: Lin et al. (2017c) address data imbalance in classication
problems. If the number of examples for some classes is much greater than for others, then the
standard maximum likelihood loss does not work well; the model may concentrate on becoming
more condent about well-classied examples from the dominant classes and classify less well-
represented classes poorly. Lin et al. (2017c) introduce focal loss, which adds a single extra
parameter that down-weights the eect of well-classied examples to improve performance.
Learning to rank: Cao et al. (2007), Xia et al. (2008), and Chen et al. (2009) all used the
Plackett-Luce model in loss functions for learning to rank data. This is the listwise approach to
learning to rank as the model ingests an entire list of objects to be ranked at once. Alternative
approaches are the pointwise approach, in which the model ingests a single object, and the
pairwise approach, where the model ingests pairs of objects. Chen et al. (2009) summarize
dierent approaches for learning to rank.
Other data types: Fan et al. (2020) use a loss based on the beta distribution for predicting
values between zero and one. Jacobs et al. (1991) and Bishop (1994) investigated mixture
density networks for multimodal data. These model the output as a mixture of Gaussians
Draft: please send errata to udlbookmail@gmail.com.
74 5 Loss functions
Figure 5.13 The von Mises distribu-
tion is dened over the circular do-
main (π, π]. It has two parameters.
The mean µ determines the position
of the peak. The concentration κ >
0 acts like the inverse of the vari-
ance. Hence 1/
κ is roughly equivalent
to the standard deviation in a normal
distribution.
(see gure 5.14) that is conditional on the input. Prokudin et al. (2018) used the von Mises
distribution to predict direction (see gure 5.13). Fallah et al. (2009) constructed loss functions
for prediction counts using the Poisson distribution (see gure 5.15). Ng et al. (2017) used loss
functions based on the gamma distribution to predict duration.
Non-probabilistic approaches: It is not strictly necessary to adopt the probabilistic ap-
proach discussed in this chapter, but this has become the default in recent years; any loss func-
tion that aims to reduce the distance between the model output and the training outputs will
suce, and distance can be dened in any way that seems sensible. There are several well-known
non-probabilistic machine learning models for classication, including support vector machines
(Vapnik, 1995; Cristianini & Shawe-Taylor, 2000), which use hinge loss, and AdaBoost (Freund
& Schapire, 1997), which uses exponential loss.
Problems
Problem 5.1 Show that the logistic sigmoid function sig[z] becomes 0 as z −∞, is 0.5
when z = 0, and becomes 1 when z , where:
sig[z] =
1
1 + exp[z]
. (5.32)
Problem 5.2 The loss L for binary classication for a single training pair {x, y} is:
L = (1 y) log
h
1 sig[f[x, ϕ]]
i
y log
h
sig[f[x, ϕ]]
i
, (5.33)
where sig[] is dened in equation 5.32. Plot this loss as a function of the transformed network
output sig[f[x, ϕ]] [0, 1] (i) when the training label y = 0 and (ii) when y = 1.
Problem 5.3
Suppose we want to build a model that predicts the direction y in radians of the
prevailing wind based on local measurements of barometric pressure x. A suitable distribution
over circular domains is the von Mises distribution (gure 5.13):
P r(y|µ, κ) =
exp
κ cos[y µ]
2π ·Bessel
0
[κ]
, (5.34)
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 75
Figure 5.14 Multimodal data and mixture of Gaussians density. a) Example
training data where, for intermediate values of the input x, the corresponding
output y follows one of two paths. For example, at x = 0, the output y might
be roughly 2 or +3 but is unlikely to be between these values. b) The mixture
of Gaussians is a probability model suited to this kind of data. As the name
suggests, the model is a weighted sum (solid cyan curve) of two or more normal
distributions with dierent means and variances (here, two weighted distributions,
dashed blue and orange curves). When the means are far apart, this forms a
multimodal distribution. c) When the means are close, the mixture can model
unimodal but non-normal densities.
where µ is a measure of the mean direction and κ is a measure of concentration (i.e., the inverse
of the variance). The term Bessel
0
[κ] is a modied Bessel function of the rst kind of order 0.
Use the recipe from section 5.2 to develop a loss function for learning the parameter µ of a
model f[x, ϕ] to predict the most likely wind direction. Your solution should treat the concen-
tration κ as constant. How would you perform inference?
Problem 5.4
Sometimes, the outputs y for input x are multimodal (gure 5.14a); there is
more than one valid prediction for a given input. Here, we might use a weighted sum of normal
components as the distribution over the output. This is known as a mixture of Gaussians model.
For example, a mixture of two Gaussians has parameters θ = {λ, µ
1
, σ
2
1
, µ
2
, σ
2
2
}:
P r(y|λ, µ
1
, µ
2
, σ
2
1
, σ
2
2
) =
λ
p
2πσ
2
1
exp
(y µ
1
)
2
2σ
2
1
+
1 λ
p
2πσ
2
2
exp
(y µ
2
)
2
2σ
2
2
, (5.35)
where λ [0, 1] controls the relative weight of the two components, which have means µ
1
, µ
2
and variances σ
2
1
, σ
2
2
, respectively. This model can represent a distribution with two peaks
(gure 5.14b) or a distribution with one peak but a more complex shape (gure 5.14c).
Use the recipe from section 5.2 to construct a loss function for training a model f[x, ϕ] that takes
input x, has parameters ϕ, and predicts a mixture of two Gaussians. The loss should be based
on I training data pairs {x
i
, y
i
}. What problems do you foresee when performing inference?
Problem 5.5 Consider extending the model from problem 5.3 to predict the wind direction using
a mixture of two von Mises distributions. Write an expression for the likelihood P r(y|θ) for
this model. How many outputs will the network need to produce?
Draft: please send errata to udlbookmail@gmail.com.
76 5 Loss functions
Figure 5.15 Poisson distribution. This discrete distribution is dened over non-
negative integers z {0, 1, 2, . . .}. It has a single parameter λ R
+
, which is
known as the rate and is the mean of the distribution. a–c) Poisson distributions
with rates of 1.4, 2.8, and 6.0, respectively.
Problem 5.6 Consider building a model to predict the number of pedestrians y {0, 1, 2, . . .}
that will pass a given point in the city in the next minute, based on data x that contains
information about the time of day, the longitude and latitude, and the type of neighborhood.
A suitable distribution for modeling counts is the Poisson distribution (gure 5.15). This has
a single parameter λ > 0 called the rate that represents the mean of the distribution. The
distribution has probability density function:
P r(y = k) =
λ
k
e
λ
k!
. (5.36)
Design a loss function for this model assuming we have access to I training pairs {x
i
, y
i
}.
Problem 5.7 Consider a multivariate regression problem where we predict ten outputs, so y
R
10
, and model each with an independent normal distribution where the means µ
d
are pre-
dicted by the network, and variances σ
2
are constant. Write an expression for the likeli-
hood P r(y|f[x, ϕ]). Show that minimizing the negative log-likelihood of this model is still
equivalent to minimizing a sum of squared terms if we don’t estimate the variance σ
2
.
Problem 5.8
Construct a loss function for making multivariate predictions y R
D
o
based
on independent normal distributions with dierent variances σ
2
d
for each dimension. Assume
a heteroscedastic model so that both the means µ
d
and variances σ
2
d
vary as a function of the
data.
Problem 5.9
Consider a multivariate regression problem in which we predict the height of a
person in meters and their weight in kilos from data x. Here, the units take quite dierent
ranges. What problems do you see this causing? Propose two solutions to these problems.
Problem 5.10
Extend the model from problem
5.3 to predict both the wind direction and the
wind speed and dene the associated loss function.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Chapter 6
Fitting models
Chapters 3 and 4 described shallow and deep neural networks. These represent families
of piecewise linear functions, where the parameters determine the particular function.
Chapter 5 introduced the loss a single number representing the mismatch between
the network predictions and the ground truth for a training set.
The loss depends on the network parameters, and this chapter considers how to nd
the parameter values that minimize this loss. This is known as learning the network’s
parameters or simply as training or tting the model. The process is to choose initial
parameter values and then iterate the following two steps: (i) compute the derivatives
(gradients) of the loss with respect to the parameters, and (ii) adjust the parameters
based on the gradients to decrease the loss. After many iterations, we hope to reach the
overall minimum of the loss function.
This chapter tackles the second of these steps; we consider algorithms that adjust
the parameters to decrease the loss. Chapter 7 discusses how to initialize the parameters
and compute the gradients for neural networks.
6.1 Gradient descent
To t a model, we need a training set {x
i
, y
i
} of input/output pairs. We seek parame-
ters ϕ for the model f[x
i
, ϕ] that map the inputs x
i
to the outputs y
i
as well as possible.
To this end, we dene a loss function L[ϕ] that returns a single number that quanti-
es the mismatch in this mapping. The goal of an optimization algorithm is to nd
parameters
ˆ
ϕ that minimize the loss:
ˆ
ϕ = argmin
ϕ
h
L[ϕ]
i
. (6.1)
There are many families of optimization algorithms, but the standard methods for train-
ing neural networks are iterative. These algorithms initialize the parameters heuristically
and then adjust them repeatedly in such a way that the loss decreases.
Draft: please send errata to udlbookmail@gmail.com.
78 6 Fitting models
The simplest method in this class is gradient descent. This starts with initial param-
eters ϕ = [ϕ
0
, ϕ
1
, . . . , ϕ
N
]
T
and iterates two steps:
Step 1. Compute the derivatives of the loss with respect to the parameters:
L
ϕ
=
L
ϕ
0
L
ϕ
1
.
.
.
L
ϕ
N
. (6.2)
Step 2. Update the parameters according to the rule:
ϕ
ϕ
α
·
L
ϕ
,
(6.3)
where the positive scalar α determines the magnitude of the change.
The rst step computes the gradient of the loss function at the current position. This
determines the uphill direction of the loss function. The second step moves a small
distance α downhill (hence the negative sign). The parameter α may be xed (in which
Notebook 6.1
Line search
case, we call it a learning rate), or we may perform a line search where we try several
values of α to nd the one that most decreases the loss.
At the minimum of the loss function, the surface must be at (or we could improve
further by going downhill). Hence, the gradient will be zero, and the parameters will stop
changing. In practice, we monitor the gradient magnitude and terminate the algorithm
when it becomes too small.
6.1.1 Linear regression example
Consider applying gradient descent to the 1D linear regression model from chapter 2. The
model f[x, ϕ] maps a scalar input x to a scalar output y and has parameters ϕ = [ϕ
0
, ϕ
1
]
T
,
which represent the y-intercept and the slope:
y = f[x, ϕ]
= ϕ
0
+ ϕ
1
x. (6.4)
Given a dataset {x
i
, y
i
} containing I input/output pairs, we choose the least squares
loss function:
L[ϕ] =
I
X
i=1
i
=
I
X
i=1
(f[x
i
, ϕ] y
i
)
2
=
I
X
i=1
(ϕ
0
+ ϕ
1
x
i
y
i
)
2
, (6.5)
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
6.1 Gradient descent 79
Figure 6.1 Gradient descent for the linear regression model. a) Training set of I =
12
input/output pairs
{
x
i
, y
i
}. b) Loss function showing iterations of gradient
descent. We start at point 0 and move in the steepest downhill direction until
we can improve no further to arrive at point 1. We then repeat this procedure.
We measure the gradient at point 1 and move downhill to point 2 and so on. c)
This can be visualized better as a heatmap, where the brightness represents the
loss. After only four iterations, we are already close to the minimum. d) The
model with the parameters at point 0 (lightest line) describes the data very badly,
but each successive iteration improves the t. The model with the parameters at
point 4 (darkest line) is already a reasonable description of the training data.
Draft: please send errata to udlbookmail@gmail.com.
80 6 Fitting models
where the term
i
= (ϕ
0
+ ϕ
1
x
i
y
i
)
2
is the individual contribution to the loss from
the i
th
training example.
The derivative of the loss function with respect to the parameters can be decomposed
into the sum of the derivatives of the individual contributions:
L
ϕ
=
ϕ
I
X
i=1
i
=
I
X
i=1
i
ϕ
, (6.6)
where these are given by:
Problem 6.1
i
ϕ
=
i
ϕ
0
i
ϕ
1
=
"
2(ϕ
0
+ ϕ
1
x
i
y
i
)
2x
i
(ϕ
0
+ ϕ
1
x
i
y
i
)
#
. (6.7)
Figure 6.1 shows the progression of this algorithm as we iteratively compute the
Notebook 6.2
Gradient descent
derivatives according to equations 6.6 and 6.7 and then update the parameters using the
rule in equation 6.3. In this case, we have used a line search procedure to nd the value
of α that decreases the loss the most at each iteration.
6.1.2 Gabor model example
Loss functions for linear regression problems (gure 6.1c) always have a single well-
dened global minimum. More formally, they are convex, which means that no chord
Problem 6.2
(line segment between two points on the surface) intersects the function. Convexity
implies that wherever we initialize the parameters, we are bound to reach the minimum
if we keep walking downhill; the training procedure can’t fail.
Unfortunately, loss functions for most nonlinear models, including both shallow and
deep networks, are non-convex. Visualizing neural network loss functions is challenging
due to the number of parameters. Hence, we rst explore a simpler nonlinear model with
two parameters to gain insight into the properties of non-convex loss functions:
f[x, ϕ] = sin[ϕ
0
+ 0.06 ·ϕ
1
x] · exp
(ϕ
0
+ 0.06 ·ϕ
1
x)
2
32.0
. (6.8)
This Gabor model maps scalar input x to scalar output y and consists of a sinusoidal
Problems 6.3–6.5
component (creating an oscillatory function) multiplied by a negative exponential com-
ponent (causing the amplitude to decrease as we move from the center). It has two
parameters ϕ = [ϕ
0
, ϕ
1
]
T
, where ϕ
0
R determines the mean position of the function
and ϕ
1
R
+
stretches or squeezes it along the x-axis (gure 6.2).
Consider a training set of I examples {x
i
, y
i
} (gure 6.3). The least squares loss
function for I training examples is dened as:
L[ϕ] =
I
X
i=1
(f[x
i
, ϕ] y
i
)
2
. (6.9)
Once more, the goal is to nd the parameters
ˆ
ϕ that minimize this loss.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
6.1 Gradient descent 81
Figure 6.2 Gabor model. This nonlinear model maps scalar input x to scalar
output y and has parameters ϕ = [ϕ
0
, ϕ
1
]
T
. It describes a sinusoidal function
that decreases in amplitude with distance from its center. Parameter ϕ
0
R
determines the position of the center. As ϕ
0
increases, the function moves left.
Parameter ϕ
1
R
+
squeezes the function along the x-axis relative to the center.
As ϕ
1
increases, the function narrows. a–c) Model with dierent parameters.
Figure 6.3 Training data for tting the
Gabor model. The training dataset con-
tains 28 input/output examples {x
i
, y
i
}.
These data were created by uniformly
sampling x
i
[15, 15], passing the
samples through a Gabor model with pa-
rameters ϕ = [0.0, 16.6]
T
, and adding
normally distributed noise.
6.1.3 Local minima and saddle points
Figure 6.4 depicts the loss function associated with the Gabor model for this dataset.
There are numerous local minima (cyan circles). Here the gradient is zero, and the loss
Problem 6.6
increases if we move in any direction, but we are not at the overall minimum of the
function. The point with the lowest loss is known as the global minimum and is depicted
by the gray circle.
If we start in a random position and use gradient descent to go downhill, there is
Problems 6.7–6.8
no guarantee that we will wind up at the global minimum and nd the best parameters
(gure 6.5a). It’s equally or even more likely that the algorithm will terminate in one
of the local minima. Furthermore, there is no way of knowing whether there is a better
solution elsewhere.
Draft: please send errata to udlbookmail@gmail.com.
82 6 Fitting models
Figure 6.4 Loss function for the Gabor model. a) The loss function is non-convex,
with multiple local minima (cyan circles) in addition to the global minimum (gray
circle). It also contains saddle points where the gradient is locally zero, but the
function increases in one direction and decreases in the other. The blue cross is
an example of a saddle point; the function decreases as we move horizontally in
either direction but increases as we move vertically. b–f) Models associated with
the dierent minima. In each case, there is no small change that decreases the
loss. Panel (c) shows the global minimum, which has a loss of 0.64.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
6.2 Stochastic gradient descent 83
Figure 6.5 Gradient descent vs. stochastic gradient descent. a) Gradient descent
with line search. As long as the gradient descent algorithm is initialized in the
right “valley” of the loss function (e.g., points 1 and 3), the parameter estimate
will move steadily toward the global minimum. However, if it is initialized outside
this valley (e.g., point 2), it will descend toward one of the local minima. b)
Stochastic gradient descent adds noise to the optimization process, so it is possible
to escape from the wrong valley (e.g., point 2) and still reach the global minimum.
In addition, the loss function contains saddle points (e.g., the blue cross in gure 6.4).
Here, the gradient is zero, but the function increases in some directions and decreases
in others. If the current parameters are not exactly at the saddle point, then gradient
descent can escape by moving downhill. However, the surface near the saddle point is
at, so it’s hard to be sure that training hasn’t converged; if we terminate the algorithm
when the gradient is small, we may erroneously stop near a saddle point.
6.2 Stochastic gradient descent
The Gabor model has two parameters, so we could nd the global minimum by either (i)
exhaustively searching the parameter space or (ii) repeatedly starting gradient descent
from dierent positions and choosing the result with the lowest loss. However, neural
network models can have millions of parameters, so neither approach is practical. In
short, using gradient descent to nd the global optimum of a high-dimensional loss
function is challenging. We can nd a minimum, but there is no way to tell whether this
Draft: please send errata to udlbookmail@gmail.com.
84 6 Fitting models
Figure 6.6 Alternative view of SGD for the Gabor model with a batch size of
three. a) Loss function for the entire training dataset. At each iteration, there
is a probability distribution of possible parameter changes (inset shows samples).
These correspond to dierent choices of the three batch elements. b) Loss function
for one possible batch. The SGD algorithm moves in the downhill direction on
this function for a distance that is determined by the learning rate and the local
gradient magnitude. The current model (dashed function in inset) changes to
better t the batch data (solid function). c) A dierent batch creates a dierent
loss function and results in a dierent update. d) For this batch, the algorithm
moves downhill with respect to the batch loss function but uphill with respect to
the global loss function in panel (a). This is how SGD can escape local minima.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
6.2 Stochastic gradient descent 85
is the global minimum or even a good one.
One of the main problems is that the nal destination of a gradient descent algorithm
Notebook 6.3
Stochastic
gradient descent
is entirely determined by the starting point. Stochastic gradient descent (SGD) attempts
to remedy this problem by adding some noise to the gradient at each step. The solution
still moves downhill on average, but at any given iteration, the direction chosen is not
necessarily in the steepest downhill direction. Indeed, it might not be downhill at all.
The SGD algorithm has the possibility of moving temporarily uphill and hence jumping
from one “valley” of the loss function to another (gure 6.5b).
6.2.1 Batches and epochs
The mechanism for introducing randomness is simple. At each iteration, the algorithm
chooses a random subset of the training data and computes the gradient from these
examples alone. This subset is known as a minibatch or batch for short. The update rule
for the model parameters ϕ
t
at iteration t is hence:
ϕ
t+1
ϕ
t
α ·
X
i∈B
t
i
[ϕ
t
]
ϕ
, (6.10)
where B
t
is a set containing the indices of the input/output pairs in the current batch
and, as before,
i
is the loss due to the i
th
pair. The term α is the learning rate, and
together with the gradient magnitude, determines the distance moved at each iteration.
The learning rate is chosen at the start of the procedure and does not depend on the
local properties of the function.
The batches are usually drawn from the dataset without replacement. The algorithm
works through the training examples until it has used all the data, at which point it
Problem 6.9
starts sampling from the full training dataset again. A single pass through the entire
training dataset is referred to as an epoch. A batch may be as small as a single example
or as large as the whole dataset. The latter case is called full-batch gradient descent and
is identical to regular (non-stochastic) gradient descent.
An alternative interpretation of SGD is that it computes the gradient of a dierent
loss function at each iteration; the loss function depends on both the model and the
training data and hence will dier for each randomly selected batch. In this view,
SGD performs deterministic gradient descent on a constantly changing loss function
(gure 6.6). However, despite this variability, the expected loss and expected gradients
at any point remain the same as for gradient descent.
6.2.2 Properties of stochastic gradient descent
SGD has several attractive features. First, although it adds noise to the trajectory, it
still improves the t to a subset of the data at each iteration. Hence, the updates tend
to be sensible even if they are not optimal. Second, because it draws training examples
without replacement and iterates through the dataset, the training examples all still
contribute equally. Third, it is less computationally expensive to compute the gradient
Draft: please send errata to udlbookmail@gmail.com.
86 6 Fitting models
from just a subset of the training data. Fourth, it can (in principle) escape local minima.
Fifth, it reduces the chances of getting stuck near saddle points; it is likely that at least
some of the possible batches will have a signicant gradient at any point on the loss
function. Finally, there is some evidence that SGD nds parameters for neural networks
that cause them to generalize well to new data in practice (see section 9.2).
SGD does not necessarily “converge” in the traditional sense. However, the hope is
that when we are close to the global minimum, all the data points will be well described
by the model. Consequently, the gradient will be small, whichever batch is chosen, and
the parameters will cease to change much. In practice, SGD is often applied with a
learning rate schedule. The learning rate α starts at a high value and is decreased by a
constant factor every N epochs. The logic is that in the early stages of training, we want
the algorithm to explore the parameter space, jumping from valley to valley to nd a
sensible region. In later stages, we are roughly in the right place and are more concerned
with ne-tuning the parameters, so we decrease α to make smaller changes.
6.3 Momentum
A common modication to stochastic gradient descent is to add a momentum term. We
update the parameters with a weighted combination of the gradient computed from the
current batch and the direction moved in the previous step:
m
t+1
β · m
t
+ (1 β)
X
i∈B
t
i
[ϕ
t
]
ϕ
ϕ
t+1
ϕ
t
α ·m
t+1
, (6.11)
where m
t
is the momentum (which drives the update at iteration t), β [0, 1) controls
the degree to which the gradient is smoothed over time, and α is the learning rate.
The recursive formulation of the momentum calculation means that the gradient step
is an innite weighted sum of all the previous gradients, where the weights get smaller
as we move back in time. The eective learning rate increases if all these gradients
Problem 6.10
are aligned over multiple iterations but decreases if the gradient direction repeatedly
changes as the terms in the sum cancel out. The overall eect is a smoother trajectory
and reduced oscillatory behavior in valleys (gure 6.7).
6.3.1 Nesterov accelerated momentum
The momentum term can be considered a coarse prediction of where the SGD algorithm
Notebook 6.4
Momentum
will move next. Nesterov accelerated momentum (gure 6.8) computes the gradients at
this predicted point rather than at the current point:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
6.3 Momentum 87
Figure 6.7 Stochastic gradient descent with momentum. a) Regular stochastic
descent takes a very indirect path toward the minimum. b) With a momentum
term, the change at the current step is a weighted combination of the previ-
ous change and the gradient computed from the batch. This smooths out the
trajectory and increases the speed of convergence.
Figure 6.8 Nesterov accelerated momen-
tum. The solution has traveled along
the dashed line to arrive at point 1. A
traditional momentum update measures
the gradient at point 1, moves some dis-
tance in this direction to point 2, and
then adds the momentum term from the
previous iteration (i.e., in the same di-
rection as the dashed line), arriving at
point 3. The Nesterov momentum up-
date rst applies the momentum term
(moving from point 1 to point 4) and
then measures the gradient and applies
an update to arrive at point 5.
Draft: please send errata to udlbookmail@gmail.com.
88 6 Fitting models
m
t+1
β · m
t
+ (1 β)
X
i∈B
t
i
[ϕ
t
αβ · m
t
]
ϕ
ϕ
t+1
ϕ
t
α ·m
t+1
, (6.12)
where now the gradients are evaluated at ϕ
t
αβ · m
t
. One way to think about this is
that the gradient term now corrects the path provided by momentum alone.
6.4 Adam
Gradient descent with a xed step size has the following undesirable property: it makes
large adjustments to parameters associated with large gradients (where perhaps we
should be more cautious) and small adjustments to parameters associated with small
gradients (where perhaps we should explore further). When the gradient of the loss
surface is much steeper in one direction than another, it is dicult to choose a learning
rate that (i) makes good progress in both directions and (ii) is stable (gures 6.9a–b).
A straightforward approach is to normalize the gradients so that we move a xed
distance (governed by the learning rate) in each direction. To do this, we rst measure
the gradient m
t+1
and the pointwise squared gradient v
t+1
:
m
t+1
L[ϕ
t
]
ϕ
v
t+1
L[ϕ
t
]
ϕ
2
. (6.13)
Then we apply the update rule:
ϕ
t+1
ϕ
t
α ·
m
t+1
v
t+1
+ ϵ
, (6.14)
where the square root and division are both pointwise, α is the learning rate, and ϵ is a
small constant that prevents division by zero when the gradient magnitude is zero. The
term v
t+1
is the squared gradient, and the positive root of this is used to normalize the
gradient itself, so all that remains is the sign in each coordinate direction. The result is
that the algorithm moves a xed distance α along each coordinate, where the direction
is determined by whichever way is downhill (gure 6.9c). This simple algorithm makes
good progress in both directions but will not converge unless it happens to land exactly
at the minimum. Instead, it will bounce back and forth around the minimum.
Adaptive moment estimation, or Adam, takes this idea and adds momentum to both
the estimate of the gradient and the squared gradient:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
6.4 Adam 89
Figure 6.9 Adaptive moment estimation (Adam). a) This loss function changes
quickly in the vertical direction but slowly in the horizontal direction. If we run
full-batch gradient descent with a learning rate that makes good progress in the
vertical direction, then the algorithm takes a long time to reach the nal hor-
izontal position. b) If the learning rate is chosen so that the algorithm makes
good progress in the horizontal direction, it overshoots in the vertical direction
and becomes unstable. c) A straightforward approach is to move a xed distance
along each axis at each step so that we move downhill in both directions. This is
accomplished by normalizing the gradient magnitude and retaining only the sign.
However, this does not usually converge to the exact minimum but instead oscil-
lates back and forth around it (here between the last two points). d) The Adam
algorithm uses momentum in both the estimated gradient and the normalization
term, which creates a smoother path.
Draft: please send errata to udlbookmail@gmail.com.
90 6 Fitting models
m
t+1
β · m
t
+ (1 β)
L[ϕ
t
]
ϕ
v
t+1
γ · v
t
+ (1 γ)
L[ϕ
t
]
ϕ
2
, (6.15)
where β and γ are the momentum coecients for the two statistics.
Using momentum is equivalent to taking a weighted average over the history of each
of these statistics. At the start of the procedure, all the previous measurements are
eectively zero, resulting in unrealistically small estimates. Consequently, we modify
these statistics using the rule:
˜
m
t+1
m
t+1
1 β
t+1
˜
v
t+1
v
t+1
1 γ
t+1
. (6.16)
Since β and γ are in the range [0, 1), the terms with exponents t +1 become smaller
with each time step, the denominators become closer to one, and this modication has
a diminishing eect.
Finally, we update the parameters as before, but with the modied terms:
ϕ
t+1
ϕ
t
α ·
˜
m
t+1
p
˜
v
t+1
+ ϵ
. (6.17)
The result is an algorithm that can converge to the overall minimum and makes good
Notebook 6.5
Adam
progress in every direction in the parameter space. Note that Adam is usually used in a
stochastic setting where the gradients and their squares are computed from mini-batches:
m
t+1
β · m
t
+ (1 β)
X
i∈B
t
i
[ϕ
t
]
ϕ
v
t+1
γ · v
t
+ (1 γ)
X
i∈B
t
i
[ϕ
t
]
ϕ
!
2
, (6.18)
and so the trajectory is noisy in practice.
As we shall see in chapter 7, the gradient magnitudes of neural network parameters
can depend on their depth in the network. Adam helps compensate for this tendency
and balances out changes across the dierent layers. In practice, Adam also has the
advantage of being less sensitive to the initial learning rate because it avoids situations
like those in gures 6.9a–b, so it doesn’t need complex learning rate schedules.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
6.5 Training algorithm hyperparameters 91
6.5 Training algorithm hyperparameters
The choices of learning algorithm, batch size, learning rate schedule, and momentum
coecients are all considered hyperparameters of the training algorithm; these directly
aect the nal model performance but are distinct from the model parameters. Choosing
these can be more art than science, and it’s common to train many models with dierent
hyperparameters and choose the best one. This is known as hyperparameter search. We
return to this issue in chapter 8.
6.6 Summary
This chapter discussed model training. This problem was framed as nding parameters ϕ
that corresponded to the minimum of a loss function L[ϕ]. The gradient descent method
measures the gradient of the loss function for the current parameters (i.e., how the loss
changes when we make a small change to the parameters). Then it moves the parameters
in the direction that decreases the loss fastest. This is repeated until convergence.
For nonlinear functions, the loss function may have both local minima (where gradi-
ent descent gets trapped) and saddle points (where gradient descent may appear to have
converged but has not). Stochastic gradient descent helps mitigate these problems.
1
At
each iteration, we use a dierent random subset of the data (a batch) to compute the
gradient. This adds noise to the process and helps prevent the algorithm from getting
trapped in a sub-optimal region of parameter space. Each iteration is also computation-
ally cheaper since it only uses a subset of the data. We saw that adding a momentum
term makes convergence more ecient. Finally, we introduced the Adam algorithm.
The ideas in this chapter apply to optimizing any model. The next chapter tackles
two aspects of training specic to neural networks. First, we address how to compute
the gradients of the loss with respect to the parameters of a neural network. This is
accomplished using the famous backpropagation algorithm. Second, we discuss how to
initialize the network parameters before optimization begins. Without careful initializa-
tion, the gradients used by the optimization can become extremely large or extremely
small, which can hinder the training process.
Notes
Optimization algorithms: Optimization algorithms are used extensively throughout engi-
neering, and it is generally more typical to use the term objective function rather than loss
function or cost function. Gradient descent was invented by Cauchy (1847), and stochastic gra-
dient descent dates back to at least Robbins & Monro (1951). A modern compromise between
the two is stochastic variance-reduced descent (Johnson & Zhang, 2013), in which the full gra-
dient is computed periodically, with stochastic updates interspersed. Reviews of optimization
algorithms for neural networks can be found in Ruder (2016), Bottou et al. (2018), and Sun
(2020). Bottou (2012) discusses best practice for SGD, including shuing without replacement.
1
Chapter 20 discusses the extent to which saddle points and local minima really are problems in
deep learning. In practice, deep networks are surprisingly easy to train.
Draft: please send errata to udlbookmail@gmail.com.
92 6 Fitting models
Convexity, minima, and saddle points: A function is convex if no chord (line segment
between two points on the surface) intersects the function. This can be tested algebraically by
considering the Hessian matrix (the matrix of second derivatives):
H[ϕ] =
2
L
ϕ
2
0
2
L
ϕ
0
ϕ
1
. . .
2
L
ϕ
0
ϕ
N
2
L
ϕ
1
ϕ
0
2
L
ϕ
2
1
. . .
2
L
ϕ
1
ϕ
N
.
.
.
.
.
.
.
.
.
.
.
.
2
L
ϕ
N
ϕ
0
2
L
ϕ
N
ϕ
1
. . .
2
L
ϕ
2
N
. (6.19)
If the Hessian matrix is positive denite (has positive eigenvalues) for all possible parameter
Appendix B.3.7
Eigenvalues
values, then the function is convex; the loss function will look like a smooth bowl (as in g-
ure 6.1c), so training will be relatively easy. There will be a single global minimum and no local
minima or saddle points.
For any loss function, the eigenvalues of the Hessian matrix at places where the gradient is
zero allow us to classify this position as (i) a minimum (the eigenvalues are all positive), (ii)
a maximum (the eigenvalues are all negative), or (iii) a saddle point (positive eigenvalues are
associated with directions in which we are at a minimum and negative ones with directions
where we are at a maximum).
Line search: Gradient descent with a xed step size is inecient because the distance moved
depends entirely on the magnitude of the gradient. It moves a long distance when the function
is changing fast (where perhaps it should be more cautious) but a short distance when the
function is changing slowly (where perhaps it should explore further). For this reason, gradient
descent methods are usually combined with a line search procedure in which we sample the
function along the desired direction to try to nd the optimal step size. One such approach
is bracketing (gure 6.10). Another problem with gradient descent is that it tends to lead to
inecient oscillatory behavior when descending valleys (e.g., path 1 in gure 6.5a).
Beyond gradient descent: Numerous algorithms have been developed that remedy the prob-
lems of gradient descent. Most notable is the Newton method, which takes the curvature of the
surface into account using the inverse of the Hessian matrix; if the gradient of the function is
changing quickly, then it applies a more cautious update. This method eliminates the need for
line search and does not suer from oscillatory behavior. However, it has its own problems; in
its simplest form, it moves toward the nearest extremum, but this may be a maximum if we
are closer to the top of a hill than we are to the bottom of a valley. Moreover, computing the
Problem 6.11
inverse Hessian is intractable when the number of parameters is large, as in neural networks.
Properties of SGD: The limit of SGD as the learning rate tends to zero is a stochastic
dierential equation. Jastrzębski et al. (2018) showed that this equation relies on the learning-
rate to batch size ratio and that there is a relation between the learning rate to batch size ratio
and the width of the minimum found. Wider minima are considered more desirable; if the loss
function for test data is similar, then small errors in the parameter estimates will have little
eect on test performance. He et al. (2019) prove a generalization bound for SGD that has a
positive correlation with the ratio of batch size to learning rate. They train a large number of
models on dierent architectures and datasets and nd empirical evidence that test accuracy
improves when the ratio of batch size to learning rate is low. Smith et al. (2018) and Goyal et al.
(2018) also identied the ratio of batch size to learning rate as being important for generalization
(see gure 20.10).
Momentum: The idea of using momentum to speed up optimization dates to Polyak (1964).
Goh (2017) presents an in-depth discussion of the properties of momentum. The Nesterov
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 93
Figure 6.10 Line search using the bracketing approach. a) The current solution is
at position a (orange point), and we wish to search the region [a, d] (gray shaded
area). We dene two points b, c interior to the search region and evaluate the loss
function at these points. Here L[b] > L[c], so we eliminate the range [a, b]. b) We
now repeat this procedure in the rened search region and nd that L[b] < L[c],
so we eliminate the range [c, d]. c) We repeat this process until this minimum is
closely bracketed.
accelerated gradient method was introduced by Nesterov (1983). Nesterov momentum was rst
applied in the context of stochastic gradient descent by Sutskever et al. (2013).
Adaptive training algorithms: AdaGrad (Duchi et al., 2011) is an optimization algorithm
that addresses the possibility that some parameters may have to move further than others by
assigning a dierent learning rate to each parameter. AdaGrad uses the cumulative squared
gradient for each parameter to attenuate its learning rate. This has the disadvantage that the
learning rates decrease over time, and learning can halt before the minimum is found. RMSProp
(Hinton et al., 2012a) and AdaDelta (Zeiler, 2012) modied this algorithm to help prevent these
problems by recursively updating the squared gradient term.
By far the most widely used adaptive training algorithm is adaptive moment optimization or
Adam (Kingma & Ba, 2015). This combines the ideas of momentum (in which the gradient
vector is averaged over time) and AdaGrad, AdaDelta, and RMSProp (in which a smoothed
squared gradient term is used to modify the learning rate for each parameter). The original
paper on the Adam algorithm provided a convergence proof for convex loss functions, but a
counterexample was identied by Reddi et al. (2018), who developed a modication of Adam
called AMSGrad, which does converge. Of course, in deep learning, the loss functions are non-
convex, and Zaheer et al. (2018) subsequently developed an adaptive algorithm called YOGI
and proved that it converges in this scenario. Regardless of these theoretical objections, the
original Adam algorithm works well in practice and is widely used, not least because it works
well over a broad range of hyperparameters and makes rapid initial progress.
One potential problem with adaptive training algorithms is that the learning rates are based on
accumulated statistics of the observed gradients. At the start of training, when there are few
samples, these statistics may be very noisy. This can be remedied by learning rate warm-up
(Goyal et al., 2018), in which the learning rates are gradually increased over the rst few thou-
sand iterations. An alternative solution is rectied Adam (Liu et al., 2021a), which gradually
Draft: please send errata to udlbookmail@gmail.com.
94 6 Fitting models
changes the momentum term over time in a way that helps avoid high variance. Dozat (2016)
incorporated Nesterov momentum into the Adam algorithm.
SGD vs. Adam: There has been a lively discussion about the relative merits of SGD and
Adam. Wilson et al. (2017) provided evidence that SGD with momentum can nd lower minima
than Adam, which generalizes better over a variety of deep learning tasks. However, this
is strange since SGD is a special case of Adam (when β = γ = 0) once the modication
term (equation 6.16) becomes one, which happens quickly. It is hence more likely that SGD
outperforms Adam when we use Adam’s default hyperparameters. Loshchilov & Hutter (2019)
proposed AdamW, which substantially improves the performance of Adam in the presence of
L2 regularization (see section 9.1). Choi et al. (2019) provide evidence that if we search for the
best Adam hyperparameters, it performs just as well as SGD and converges faster. Keskar &
Socher (2017) proposed a method called SWATS that starts using Adam (to make rapid initial
progress) and then switches to SGD (to get better nal generalization performance).
Exhaustive search: All the algorithms discussed in this chapter are iterative. A completely
dierent approach is to quantize the network parameters and exhaustively search the resulting
discretized parameter space using SAT solvers (Mézard & Mora, 2009). This approach has
the potential to nd the global minimum and provide a guarantee that there is no lower loss
elsewhere but is only practical for very small models.
Problems
Problem 6.1 Show that the derivatives of the least squares loss function in equation 6.5 are
given by the expressions in equation 6.7.
Problem 6.2 A surface is convex if the eigenvalues of the Hessian H[ϕ] are positive everywhere.
In this case, the surface has a unique minimum, and optimization is easy. Find an algebraic
expression for the Hessian matrix,
H[ϕ] =
2
L
ϕ
2
0
2
L
ϕ
0
ϕ
1
2
L
ϕ
1
ϕ
0
2
L
ϕ
2
1
, (6.20)
for the linear regression model (equation 6.5). Prove that this function is convex by showing
Appendix B.3.7
Eigenvalues
Appendix B.3.8
Trace
Appendix B.3.8
Determinant
that the eigenvalues are always positive. This can be done by showing that both the trace and
the determinant of the matrix are positive.
Problem 6.3 Compute the derivatives of the least squares loss L[ϕ] with respect to the param-
eters ϕ
0
and ϕ
1
for the Gabor model (equation 6.8).
Problem 6.4
The logistic regression model uses a linear function to assign an input x to one
of two classes y {0, 1}. For a 1D input and a 1D output, it has two parameters, ϕ
0
and ϕ
1
,
and is dened by:
P r(y = 1|x) = sig[ϕ
0
+ ϕ
1
x], (6.21)
where sig[] is the logistic sigmoid function:
sig[z] =
1
1 + exp[z]
. (6.22)
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 95
Figure 6.11 Three 1D loss functions for problem 6.6.
(i) Plot y against x for this model for dierent values of ϕ
0
and ϕ
1
and explain the qualitative
meaning of each parameter. (ii) What is a suitable loss function for this model? (iii) Compute
the derivatives of this loss function with respect to the parameters. (iv) Generate ten data
points from a normal distribution with mean -1 and standard deviation 1 and assign them the
label y = 0. Generate another ten data points from a normal distribution with mean 1 and
standard deviation 1 and assign these the label y = 1. Plot the loss as a heatmap in terms of
the two parameters ϕ
0
and ϕ
1
. (v) Is this loss function convex? How could you prove this?
Problem 6.5
Compute the derivatives of the least squares loss with respect to the ten param-
eters of the simple neural network model introduced in equation 3.1:
f[x, ϕ] = ϕ
0
+ ϕ
1
a[θ
10
+ θ
11
x] + ϕ
2
a[θ
20
+ θ
21
x] + ϕ
3
a[θ
30
+ θ
31
x]. (6.23)
Think carefully about what the derivative of the ReLU function a[] will be.
Problem 6.6 Which of the functions in gure 6.11 is convex? Justify your answer. Characterize
each of the points 1–7 as (i) a local minimum, (ii) the global minimum, or (iii) neither.
Problem 6.7
The gradient descent trajectory for path 1 in gure 6.5a oscillates back and forth
ineciently as it moves down the valley toward the minimum. It’s also notable that it turns at
right angles to the previous direction at each step. Provide a qualitative explanation for these
phenomena. Propose a solution that might help prevent this behavior.
Problem 6.8
Can (non-stochastic) gradient descent with a xed learning rate escape local
minima?
Problem 6.9 We run the stochastic gradient descent algorithm for 1,000 iterations on a dataset
of size 100 with a batch size of 20. For how many epochs did we train the model?
Problem 6.10 Show that the momentum term m
t
(equation 6.11) is an innite weighted sum
of the gradients at the previous iterations and derive an expression for the coecients (weights)
of that sum.
Problem 6.11 What dimensions will the Hessian have if the model has one million parameters?
Draft: please send errata to udlbookmail@gmail.com.
Chapter 7
Gradients and initialization
Chapter 6 introduced iterative optimization algorithms. These are general-purpose meth-
ods for nding the minimum of a function. In the context of neural networks, they nd
parameters that minimize the loss so that the model accurately predicts the training
outputs from the inputs. The basic approach is to choose initial parameters randomly
and then make a series of small changes that decrease the loss on average. Each change is
based on the gradient of the loss with respect to the parameters at the current position.
This chapter discusses two issues that are specic to neural networks. First, we
consider how to calculate the gradients eciently. This is a serious challenge since the
largest models at the time of writing have 10
12
parameters, and the gradient needs to
be computed for every parameter at every iteration of the training algorithm. Second,
we consider how to initialize the parameters. If this is not done carefully, the initial
losses and their gradients can be extremely large or small. In either case, this impedes
the training process.
7.1 Problem denitions
Consider a network f[x, ϕ] with multivariate input x, parameters ϕ, and three hidden
layers h
1
, h
2
, and h
3
:
h
1
= a[β
0
+
0
x]
h
2
= a[β
1
+
1
h
1
]
h
3
= a[β
2
+
2
h
2
]
f[x, ϕ] = β
3
+
3
h
3
, (7.1)
where the function a[] applies the activation function separately to every element of the
input. The model parameters ϕ = {β
0
,
0
, β
1
,
1
, β
2
,
2
, β
3
,
3
} consist of the bias
vectors β
k
and weight matrices
k
between every layer (gure 7.1).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.2 Computing derivatives 97
We also have individual loss terms
i
, which return the negative log-likelihood of
the ground truth label y
i
given the model prediction f[x
i
, ϕ] for training input x
i
. For
example, this might be the least squares loss
i
= (f[x
i
, ϕ] y
i
)
2
. The total loss is the
sum of these terms over the training data:
L[ϕ] =
I
X
i=1
i
. (7.2)
The most commonly used optimization algorithm for training neural networks is
stochastic gradient descent (SGD), which updates the parameters as:
ϕ
t+1
ϕ
t
α
X
i∈B
t
i
[ϕ
t
]
ϕ
, (7.3)
where α is the learning rate, and B
t
contains the batch indices at iteration t. To compute
this update, we need to calculate the derivatives:
i
β
k
and
i
k
, (7.4)
for the parameters {β
k
,
k
} at every layer k {0, 1, . . . , K} and for each index i in
Problem 7.1
the batch. The rst part of this chapter describes the backpropagation algorithm, which
computes these derivatives eciently.
In the second part of the chapter, we consider how to initialize the network parameters
before we commence training. We describe methods to choose the initial weights
k
and
biases β
k
so that training is stable.
7.2 Computing derivatives
The derivatives of the loss tell us how the loss changes when we make a small change
to the parameters. Optimization algorithms exploit this information to manipulate the
parameters so that the loss becomes smaller. The backpropagation algorithm computes
these derivatives. The mathematical details are somewhat involved, so we rst make two
observations that provide some intuition.
Observation 1: Each weight (element of
k
) multiplies the activation at a source hidden
unit and adds the result to a destination hidden unit in the next layer. It follows that the
eect of any small change to the weight is amplied or attenuated by the activation at
the source hidden unit. Hence, we run the network for each data example in the batch
and store the activations of all the hidden units. This is known as the forward pass
(gure 7.1). The stored activations will subsequently be used to compute the gradients.
Observation 2: A small change in a bias or weight causes a ripple eect of changes
through the subsequent network. The change modies the value of its destination hidden
Draft: please send errata to udlbookmail@gmail.com.
98 7 Gradients and initialization
Figure 7.1 Backpropagation forward pass. The goal is to compute the derivatives
of the loss with respect to each of the weights (arrows) and biases (not shown).
In other words, we want to know how a small change to each parameter will aect
the loss. Each weight multiplies the hidden unit at its source and contributes the
result to the hidden unit at its destination. Consequently, the eects of any small
change to the weight will be scaled by the activation of the source hidden unit.
For example, the blue weight is applied to the second hidden unit at layer 1; if
the activation of this unit doubles, then the eect of a small change to the blue
weight will double too. Hence, to compute the derivatives of the weights, we need
to calculate and store the activations at the hidden layers. This is known as the
forward pass since it involves running the network equations sequentially.
unit. This, in turn, changes the values of the hidden units in the subsequent layer, which
will change the hidden units in the layer after that, and so on, until a change is made to
the model output and, nally, the loss.
Hence, to know how changing a parameter modies the loss, we also need to know
how changes to every subsequent hidden layer will, in turn, modify their successor. These
same quantities are required when considering other parameters in the same or earlier
layers. It follows that we can calculate them once and reuse them. For example, consider
computing the eect of a small change in weights that feed into hidden layers h
3
, h
2
,
and h
1
, respectively:
To calculate how a small change in a weight or bias feeding into hidden layer h
3
modies the loss, we need to know (i) how a change in layer h
3
changes the model
output f , and (ii) how a change in this output changes the loss (gure 7.2a).
To calculate how a small change in a weight or bias feeding into hidden layer h
2
modies the loss, we need to know (i) how a change in layer h
2
aects h
3
, (ii) how h
3
changes the model output, and (iii) how this output changes the loss (gure 7.2b).
To calculate how a small change in a weight or bias feeding into hidden layer h
1
modies the loss, we need to know (i) how a change in layer h
1
aects layer h
2
,
(ii) how a change in layer h
2
aects layer h
3
, (iii) how layer h
3
changes the model
output, and (iv) how the model output changes the loss (gure 7.2c).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.2 Computing derivatives 99
Figure 7.2 Backpropagation backward pass. a) To compute how a change to
a weight feeding into layer h
3
(blue arrow) changes the loss, we need to know
how the hidden unit in h
3
changes the model output f and how f changes the
loss (orange arrows). b) To compute how a small change to a weight feeding
into h
2
(blue arrow) changes the loss, we need to know (i) how the hidden unit
in h
2
changes h
3
, (ii) how h
3
changes f , and (iii) how f changes the loss (orange
arrows). c) Similarly, to compute how a small change to a weight feeding into h
1
(blue arrow) changes the loss, we need to know how h
1
changes h
2
and how
these changes propagate through to the loss (orange arrows). The backward pass
rst computes derivatives at the end of the network and then works backward to
exploit the inherent redundancy of these computations.
Draft: please send errata to udlbookmail@gmail.com.
100 7 Gradients and initialization
As we move backward through the network, we see that most of the terms we need
were already calculated in the previous step, so we do not need to re-compute them.
Proceeding backward through the network in this way to compute the derivatives is
known as the backward pass.
The ideas behind backpropagation are relatively easy to understand. However, the
derivation requires matrix calculus because the bias and weight terms are vectors and
matrices, respectively. To help grasp the underlying mechanics, the following section
derives backpropagation for a simpler toy model with scalar parameters. We then apply
the same approach to a deep neural network in section 7.4.
7.3 Toy example
Consider a model f[x, ϕ] with eight scalar parameters ϕ = {β
0
, ω
0
, β
1
, ω
1
, β
2
, ω
2
, β
3
, ω
3
}
that consists of a composition of the functions sin[], exp[], and cos[]:
f[x, ϕ] = β
3
+ ω
3
· cos
h
β
2
+ ω
2
· exp
β
1
+ ω
1
· sin[β
0
+ ω
0
· x]
i
, (7.5)
and a least squares loss function L[ϕ] =
P
i
i
with individual terms:
i
= (f[x
i
, ϕ] y
i
)
2
, (7.6)
where, as usual, x
i
is the i
th
training input, and y
i
is the i
th
training output. You can
think of this as a simple neural network with one input, one output, one hidden unit at
each layer, and dierent activation functions sin[], exp[], and cos[] between each layer.
We aim to compute the derivatives:
i
β
0
,
i
ω
0
,
i
β
1
,
i
ω
1
,
i
β
2
,
i
ω
2
,
i
β
3
, and
i
ω
3
.
Of course, we could nd expressions for these derivatives by hand and compute them
directly. However, some of these expressions are quite complex. For example:
i
ω
0
= 2
β
3
+ ω
3
· cos
h
β
2
+ ω
2
· exp
β
1
+ ω
1
· sin[β
0
+ ω
0
· x
i
]
i
y
i
·ω
1
ω
2
ω
3
· x
i
· cos[β
0
+ ω
0
· x
i
] · exp
h
β
1
+ ω
1
· sin[β
0
+ ω
0
· x
i
]
i
·sin
β
2
+ ω
2
· exp
h
β
1
+ ω
1
· sin[β
0
+ ω
0
· x
i
]
i
. (7.7)
Such expressions are awkward to derive and code without mistakes and do not exploit
the inherent redundancy; notice that the three exponential terms are the same.
The backpropagation algorithm is an ecient method for computing all of these
derivatives at once. It consists of (i) a forward pass, in which we compute and store a
series of intermediate values and the network output, and (ii) a backward pass, in which
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.3 Toy example 101
Figure 7.3 Backpropagation forward pass. We compute and store each of the
intermediate variables in turn until we nally calculate the loss.
we calculate the derivatives of each parameter, starting at the end of the network, and
reusing previous calculations as we move toward the start.
Forward pass: We treat the computation of the loss as a series of calculations:
f
0
= β
0
+ ω
0
· x
i
h
1
= sin[f
0
]
f
1
= β
1
+ ω
1
· h
1
h
2
= exp[f
1
]
f
2
= β
2
+ ω
2
· h
2
h
3
= cos[f
2
]
f
3
= β
3
+ ω
3
· h
3
i
= (f
3
y
i
)
2
. (7.8)
We compute and store the values of the intermediate variables f
k
and h
k
(gure 7.3).
Backward pass #1: We now compute the derivatives of
i
with respect to these inter-
mediate variables, but in reverse order:
i
f
3
,
i
h
3
,
i
f
2
,
i
h
2
,
i
f
1
,
i
h
1
, and
i
f
0
. (7.9)
The rst of these derivatives is straightforward:
i
f
3
= 2(f
3
y
i
). (7.10)
The next derivative can be calculated using the chain rule:
i
h
3
=
f
3
h
3
i
f
3
. (7.11)
The left-hand side asks how
i
changes when h
3
changes. The right-hand side says we can
decompose this into (i) how f
3
changes when h
3
changes and (ii) how
i
changes when f
3
changes. In the original equations, h
3
changes f
3
, which changes
i
, and the derivatives
Draft: please send errata to udlbookmail@gmail.com.
102 7 Gradients and initialization
Figure 7.4 Backpropagation backward pass #1. We work backward from the end
of the function computing the derivatives
i
/∂f
k
and
i
/∂h
k
of the loss with
respect to the intermediate quantities. Each derivative is computed from the
previous one by multiplying by terms of the form f
k
/∂h
k
or h
k
/∂f
k1
.
represent the eects of this chain. Notice that we already computed the second of these
derivatives, and the other is the derivative of β
3
+ ω
3
·h
3
with respect to h
3
, which is ω
3
.
We continue in this way, computing the derivatives of the output with respect to
these intermediate quantities (gure 7.4):
i
f
2
=
h
3
f
2
f
3
h
3
i
f
3
i
h
2
=
f
2
h
2
h
3
f
2
f
3
h
3
i
f
3
i
f
1
=
h
2
f
1
f
2
h
2
h
3
f
2
f
3
h
3
i
f
3
i
h
1
=
f
1
h
1
h
2
f
1
f
2
h
2
h
3
f
2
f
3
h
3
i
f
3
i
f
0
=
h
1
f
0
f
1
h
1
h
2
f
1
f
2
h
2
h
3
f
2
f
3
h
3
i
f
3
. (7.12)
In each case, we have already computed the quantities in the brackets in the previous
Problem 7.2
step, and the last term has a simple expression. These equations embody Observation 2
from the previous section (gure 7.2); we can reuse the previously computed derivatives
if we calculate them in reverse order.
Backward pass #2: Finally, we consider how the loss
i
changes when we change the
parameters {β
k
} and {ω
k
}. Once more, we apply the chain rule (gure 7.5):
i
β
k
=
f
k
β
k
i
f
k
i
ω
k
=
f
k
ω
k
i
f
k
. (7.13)
In each case, the second term on the right-hand side was computed in equation 7.12.
When k > 0, we have f
k
= β
k
+ ω
k
· h
k
, so:
f
k
β
k
= 1 and
f
k
ω
k
= h
k
. (7.14)
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.4 Backpropagation algorithm 103
Figure 7.5 Backpropagation backward pass #2. Finally, we compute the deriva-
tives
i
/∂β
k
and
i
/∂ω
k
. Each derivative is computed by multiplying the
term
i
/∂f
k
by f
k
/∂β
k
or f
k
/∂ω
k
as appropriate.
This is consistent with Observation 1 from the previous section; the eect of a change
in the weight ω
k
is proportional to the value of the source variable h
k
(which was stored
in the forward pass). The nal derivatives from the term f
0
= β
0
+ ω
0
· x
i
are:
Notebook 7.1
Backpropagation
in toy model
f
0
β
0
= 1 and
f
0
ω
0
= x
i
. (7.15)
Backpropagation is both simpler and more ecient than computing the derivatives in-
dividually, as in equation 7.7.
1
7.4 Backpropagation algorithm
Now we repeat this process for a three-layer network (gure 7.1). The intuition and much
of the algebra are identical. The main dierences are that intermediate variables f
k
, h
k
are vectors, the biases β
k
are vectors, the weights
k
are matrices, and we are using
ReLU functions rather than simple algebraic functions like cos[].
Forward pass: We write the network as a series of sequential calculations:
f
0
= β
0
+
0
x
i
h
1
= a[f
0
]
f
1
= β
1
+
1
h
1
h
2
= a[f
1
]
f
2
= β
2
+
2
h
2
h
3
= a[f
2
]
f
3
= β
3
+
3
h
3
i
= l[f
3
, y
i
], (7.16)
1
Note that we did not actually need the derivatives l
i
/∂h
k
of the loss with respect to the activations.
In the nal backpropagation algorithm, we will not compute these explicitly.
Draft: please send errata to udlbookmail@gmail.com.
104 7 Gradients and initialization
Figure 7.6 Derivative of rectied linear
unit. The rectied linear unit (orange
curve) returns zero when the input is
less than zero and returns the input oth-
erwise. Its derivative (cyan curve) re-
turns zero when the input is less than
zero (since the slope here is zero) and
one when the input is greater than zero
(since the slope here is one).
where f
k1
represents the pre-activations at the k
th
hidden layer (i.e., the values before
the ReLU function a[]) and h
k
contains the activations at the k
th
hidden layer (i.e., after
the ReLU function). The term l[f
3
, y
i
] represents the loss function (e.g., least squares or
binary cross-entropy loss). In the forward pass, we work through these calculations and
store all the intermediate quantities.
Backward pass #1: Now let’s consider how the loss changes when we modify the pre-
activations f
0
, f
1
, f
2
. Applying the chain rule, the expression for the derivative of the
Appendix B.5
Matrix calculus
loss
i
with respect to f
2
is:
i
f
2
=
h
3
f
2
f
3
h
3
i
f
3
. (7.17)
The three terms on the right-hand side have sizes D
3
× D
3
, D
3
× D
f
, and D
f
× 1,
respectively, where D
3
is the number of hidden units in the third layer, and D
f
is the
dimensionality of the model output f
3
.
Similarly, we can compute how the loss changes when we change f
1
and f
0
:
i
f
1
=
h
2
f
1
f
2
h
2
h
3
f
2
f
3
h
3
i
f
3
(7.18)
i
f
0
=
h
1
f
0
f
1
h
1
h
2
f
1
f
2
h
2
h
3
f
2
f
3
h
3
i
f
3
. (7.19)
Note that in each case, the term in brackets was computed in the previous step. By
Problem 7.3
working backward through the network, we can reuse the previous computations.
Moreover, the terms themselves are simple. Working backward through the right-
Problems 7.4–7.5
hand side of equation 7.17, we have:
The derivative
i
/∂f
3
of the loss
i
with respect to the network output f
3
will
depend on the loss function but usually has a simple form.
The derivative f
3
/∂h
3
of the network output with respect to hidden layer h
3
is:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.4 Backpropagation algorithm 105
f
3
h
3
=
h
3
(β
3
+
3
h
3
) =
T
3
. (7.20)
If you are unfamiliar with matrix calculus, this result is not obvious. It is explored
Problem 7.6
in problem 7.6.
The derivative h
3
/∂f
2
of the output h
3
of the activation function with respect to
its input f
2
will depend on the activation function. It will be a diagonal matrix
since each activation only depends on the corresponding pre-activation. For ReLU
functions, the diagonal terms are zero everywhere f
2
is less than zero and one
Problems 7.7–7.8
otherwise (gure 7.6). Rather than multiply by this matrix, we extract the diagonal
terms as a vector I[f
2
> 0] and pointwise multiply, which is more ecient.
The terms on the right-hand side of equations
7.18 and 7.19 have similar forms. As
we progress back through the network, we alternately (i) multiply by the transpose of
the weight matrices
T
k
and (ii) threshold based on the inputs f
k1
to the hidden layer.
These inputs were stored during the forward pass.
Backward pass #2: Now that we know how to compute
i
/∂f
k
, we can focus on
calculating the derivatives of the loss with respect to the weights and biases. To calculate
the derivatives of the loss with respect to the biases β
k
, we again use the chain rule:
i
β
k
=
f
k
β
k
i
f
k
=
β
k
(β
k
+
k
h
k
)
i
f
k
=
i
f
k
, (7.21)
which we already calculated in equations 7.17 and 7.18.
Similarly, the derivative for the weights matrix
k
, is given by:
i
k
=
f
k
k
i
f
k
=
k
(β
k
+
k
h
k
)
i
f
k
=
i
f
k
h
T
k
. (7.22)
Again, the progression from line two to line three is not obvious and is explored in
Problem 7.9
problem 7.9. However, the result makes sense. The nal line is a matrix of the same size
as
k
. It depends linearly on h
k
, which was multiplied by
k
in the original expression.
This is also consistent with the initial intuition that the derivative of the weights in
k
will be proportional to the values of the hidden units h
k
that they multiply. Recall that
we already computed these during the forward pass.
Draft: please send errata to udlbookmail@gmail.com.
106 7 Gradients and initialization
7.4.1 Backpropagation algorithm summary
We now briey summarize the nal backpropagation algorithm. Consider a deep neural
network f[x
i
, ϕ] that takes input x
i
, has K hidden layers with ReLU activations, and
individual loss term
i
= l[f[x
i
, ϕ], y
i
]. The goal of backpropagation is to compute the
derivatives
i
/∂β
k
and
i
/∂
k
with respect to the biases β
k
and weights
k
.
Forward pass: We compute and store the following quantities:
f
0
= β
0
+
0
x
i
h
k
= a[f
k1
] k {1, 2, . . . , K}
f
k
= β
k
+
k
h
k
. k {1, 2, . . . , K} (7.23)
Backward pass: We start with the derivative
i
/∂f
K
of the loss function
i
with respect
to the network output f
K
and work backward through the network:
i
β
k
=
i
f
k
k {K, K 1, . . . , 1}
i
k
=
i
f
k
h
T
k
k {K, K 1, . . . , 1}
i
f
k1
= I[f
k1
> 0]
T
k
i
f
k
, k {K, K 1, . . . , 1} (7.24)
where denotes pointwise multiplication, and I[f
k1
> 0] is a vector containing ones
where f
k1
is greater than zero and zeros elsewhere. Finally, we compute the derivatives
with respect to the rst set of biases and weights:
i
β
0
=
i
f
0
i
0
=
i
f
0
x
T
i
. (7.25)
We calculate these derivatives for every training example in the batch and sum them
Problem 7.10
together to retrieve the gradient for the SGD update.
Notebook 7.2
Backpropagation
Note that the backpropagation algorithm is extremely ecient; the most demanding
computational step in both the forward and backward pass is matrix multiplication (by
and
T
, respectively) which only requires additions and multiplications. However, it is
not memory ecient; the intermediate values in the forward pass must all be stored, and
this can limit the size of the model we can train.
7.4.2 Algorithmic dierentiation
Although it’s important to understand the backpropagation algorithm, it’s unlikely that
you will need to code it in practice. Modern deep learning frameworks such as PyTorch
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.5 Parameter initialization 107
and TensorFlow calculate the derivatives automatically, given the model specication.
This is known as algorithmic dierentiation.
Each functional component (linear transform, ReLU activation, loss function) in the
framework knows how to compute its own derivative. For example, the PyTorch ReLU
function z
out
= relu[z
in
] knows how to compute the derivative of its output z
out
with
respect to its input z
in
. Similarly, a linear function z
out
= β + Ωz
in
knows how to
compute the derivatives of the output z
out
with respect to the input z
in
and with re-
spect to the parameters β and . The algorithmic dierentiation framework also knows
the sequence of operations in the network and thus has all the information required to
perform the forward and backward passes.
These frameworks exploit the massive parallelism of modern graphics processing units
(GPUs). Computations such as matrix multiplication (which features in both the forward
and backward pass) are naturally amenable to parallelization. Moreover, it’s possible to
Problem 7.11
perform the forward and backward passes for the entire batch in parallel if the model
and intermediate results in the forward pass do not exceed the available memory.
Since the training algorithm now processes the entire batch in parallel, the input
becomes a multi-dimensional tensor. In this context, a tensor can be considered the
generalization of a matrix to arbitrary dimensions. Hence, a vector is a 1D tensor, a
matrix is a 2D tensor, and a 3D tensor is a 3D grid of numbers. Until now, the training
data have been 1D, so the input for backpropagation would be a 2D tensor where the
rst dimension indexes the batch element and the second indexes the data dimension.
In subsequent chapters, we will encounter more complex structured input data. For
example, in models where the input is an RGB image, the original data examples are
3D (height × width × channel). Here, the input to the learning framework would be a
4D tensor, where the extra dimension indexes the batch element.
7.4.3 Extension to arbitrary computational graphs
We have described backpropagation in a deep neural network that is naturally sequential;
we calculate the intermediate quantities f
0
, h
1
, f
1
, h
2
. . . , f
k
in turn. However, models
need not be restricted to sequential computation. Later in this book, we will meet
models with branching structures. For example, we might take the values in a hidden
layer and process them through two dierent sub-networks before recombining.
Problems 7.12–7.13
Fortunately, the ideas of backpropagation still hold if the computational graph is
acyclic. Modern algorithmic dierentiation frameworks such as PyTorch and TensorFlow
can handle arbitrary acyclic computational graphs.
7.5 Parameter initialization
The backpropagation algorithm computes the derivatives that are used by stochastic
gradient descent and Adam to train the model. We now address how to initialize the
parameters before we start training. To see why this is crucial, consider that during the
forward pass, each set of pre-activations f
k
is computed as:
Draft: please send errata to udlbookmail@gmail.com.
108 7 Gradients and initialization
f
k
= β
k
+
k
h
k
= β
k
+
k
a[f
k1
], (7.26)
where a[] applies the ReLU functions and
k
and β
k
are the weights and biases, respec-
tively. Imagine that we initialize all the biases to zero and the elements of
k
according
to a normal distribution with mean zero and variance σ
2
. Consider two scenarios:
If the variance σ
2
is very small (e.g., 10
5
), then each element of β
k
+
k
h
k
will be
a weighted sum of h
k
where the weights are very small; the result will likely have
a smaller magnitude than the input. In addition, the ReLU function clips values
less than zero, so the range of h
k
will be half that of f
k1
. Consequently, the
magnitudes of the pre-activations at the hidden layers will get smaller and smaller
as we progress through the network.
If the variance σ
2
is very large (e.g., 10
5
), then each element of β
k
+
k
h
k
will be
a weighted sum of h
k
where the weights are very large; the result is likely to have
a much larger magnitude than the input. The ReLU function halves the range of
the inputs, but if σ
2
is large enough, the magnitudes of the pre-activations will still
get larger as we progress through the network.
In these two situations, the values at the pre-activations can become so small or so large
that they cannot be represented with nite precision oating point arithmetic.
Even if the forward pass is tractable, the same logic applies to the backward pass.
Each gradient update (equation 7.24) consists of multiplying by
T
. If the values of
are not initialized sensibly, then the gradient magnitudes may decrease or increase un-
controllably during the backward pass. These cases are known as the vanishing gradient
problem
and the
exploding gradient problem
, respectively. In the former case, updates to
the model become vanishingly small. In the latter case, they become unstable.
7.5.1 Initialization for forward pass
We now present a mathematical version of the same argument. Consider the computation
between adjacent pre-activations f and f
with dimensions D
h
and D
h
, respectively:
h = a[f ],
f
= β + Ωh (7.27)
where f represents the pre-activations, , and β represent the weights and biases,
and a[] is the activation function.
Assume the pre-activations f
j
in the input layer f have variance σ
2
f
. Consider ini-
tializing the biases β
i
to zero and the weights
ij
as normally distributed with mean
zero and variance σ
2
. Now we derive expressions for the mean and variance of the
pre-activations f
in the subsequent layer.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.5 Parameter initialization 109
The expectation (mean) E[f
i
] of the intermediate values f
i
is:
Appendix C.2
Expectation
E[f
i
] = E
β
i
+
D
h
X
j=1
ij
h
j
= E [β
i
] +
D
h
X
j=1
E [Ω
ij
h
j
]
= E [β
i
] +
D
h
X
j=1
E [Ω
ij
] E [h
j
]
= 0 +
D
h
X
j=1
0 · E [h
j
] = 0, (7.28)
where D
h
is the dimensionality of the input layer h. We have used the rules for manipu-
Appendix C.2.1
Expectation rules
lating expectations, and we have assumed that the distributions over the hidden units h
j
and the network weights
ij
are independent between the second and third lines.
Using this result, we see that the variance σ
2
f
of the pre-activations f
i
is:
σ
2
f
= E[f
2
i
] E[f
i
]
2
= E
β
i
+
D
h
X
j=1
ij
h
j
2
0
= E
D
h
X
j=1
ij
h
j
2
=
D
h
X
j=1
E
2
ij
E
h
2
j
=
D
h
X
j=1
σ
2
E
h
2
j
= σ
2
D
h
X
j=1
E
h
2
j
, (7.29)
where we have used the variance identity σ
2
= E[(z E[z])
2
] = E[z
2
] E[z]
2
. We have
Appendix C.2.3
Variance identity
assumed once more that the distributions of the weights
ij
and the hidden units h
j
are
independent between lines three and four.
Assuming that the input distribution of pre-activations f
j
is symmetric about zero,
half of these pre-activations will be clipped by the ReLU function, and the second moment
E[h
2
j
] will be half the variance σ
2
f
of f
j
(see problem 7.14):
Problem 7.14
σ
2
f
= σ
2
D
h
X
j=1
σ
2
f
2
=
1
2
D
h
σ
2
σ
2
f
. (7.30)
Draft: please send errata to udlbookmail@gmail.com.
110 7 Gradients and initialization
Figure 7.7 Weight initialization. Consider a deep network with 50 hidden layers
and D
h
= 100 hidden units per layer. The network has a 100-dimensional input x
initialized from a standard normal distribution, a single xed target y = 0, and
a least squares loss function. The bias vectors β
k
are initialized to zero, and the
weight matrices
k
are initialized with a normal distribution with mean zero and
ve dierent variances σ
2
{0.001, 0.01, 0.02, 0.1, 1.0}. a) Variance of hidden
unit activations computed in forward pass as a function of the network layer. For
He initialization (σ
2
= 2/D
h
= 0.02), the variance is stable. However, for larger
values, it increases rapidly, and for smaller values, it decreases rapidly (note
log scale). b) The variance of the gradients in the backward pass (solid lines)
continues this trend; if we initialize with a value larger than 0.02, the magnitude
of the gradients increases rapidly as we pass back through the network. If we
initialize with a value smaller, then the magnitude decreases. These are known
as the exploding gradient and vanishing gradient problems, respectively.
This, in turn, implies that if we want the variance σ
2
f
of the subsequent pre-activations f
to be the same as the variance σ
2
f
of the original pre-activations f during the forward
pass, we should set:
σ
2
=
2
D
h
, (7.31)
where D
h
is the dimension of the original layer to which the weights were applied. This
is known as He initialization.
7.5.2 Initialization for backward pass
A similar argument establishes how the variance of the gradients l/∂f
k
changes during
the backward pass. During the backward pass, we multiply by the transpose
T
of the
weight matrix (equation 7.24), so the equivalent expression becomes:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
7.6 Example training code 111
σ
2
=
2
D
h
, (7.32)
where D
h
is the dimension of the layer that the weights feed into.
7.5.3 Initialization for both forward and backward pass
If the weight matrix is not square (i.e., there are dierent numbers of hidden units
in the two adjacent layers, so D
h
and D
h
dier), then it is not possible to choose the
variance to satisfy both equations 7.31 and 7.32 simultaneously. One possible compromise
is to use the mean (D
h
+ D
h
)/2 as a proxy for the number of terms, which gives:
σ
2
=
4
D
h
+ D
h
. (7.33)
Figure 7.7 shows empirically that both the variance of the hidden units in the forward
Problem 7.15
Notebook 7.3
Initialization
pass and the variance of the gradients in the backward pass remain stable when the
parameters are initialized appropriately.
7.6 Example training code
The primary focus of this book is scientic; this is not a guide for implementing deep
learning models. Nonetheless, in gure 7.8, we present PyTorch code that implements
the ideas explored in this book so far. The code denes a neural network and initializes
Problems 7.16–7.17
the weights. It creates random input and output datasets and denes a least squares loss
function. The model is trained from the data using SGD with momentum in batches of
size 10 over 100 epochs. The learning rate starts at 0.01 and halves every 10 epochs.
The takeaway is that although the underlying ideas in deep learning are quite com-
plex, implementation is relatively simple. For example, all of the details of the back-
propagation are hidden in the single line of code: loss.backward().
7.7 Summary
The previous chapter introduced stochastic gradient descent (SGD), an iterative opti-
mization algorithm that aims to nd the minimum of a function. In the context of neural
networks, this algorithm nds the parameters that minimize the loss function. SGD re-
lies on the gradient of the loss function with respect to the parameters, which must be
initialized before optimization. This chapter has addressed these two problems for deep
neural networks.
The gradients must be evaluated for a very large number of parameters, for each
member of the batch, and at each SGD iteration. It is hence imperative that the gradient
Draft: please send errata to udlbookmail@gmail.com.
112 7 Gradients and initialization
import torch, torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
from torch.optim.lr_scheduler import StepLR
# define input size, hidden layer size, output size
D_i, D_k, D_o = 10, 40, 5
# create model with two hidden layers
model = nn.Sequential(
nn.Linear(D_i, D_k),
nn.ReLU(),
nn.Linear(D_k, D_k),
nn.ReLU(),
nn.Linear(D_k, D_o))
# He initialization of weights
def weights_init(layer_in):
if isinstance(layer_in, nn.Linear):
nn.init.kaiming_normal_(layer_in.weight)
layer_in.bias.data.fill_(0.0)
model.apply(weights_init)
# choose least squares loss function
criterion = nn.MSELoss()
# construct SGD optimizer and initialize learning rate and momentum
optimizer = torch.optim.SGD(model.parameters(), lr = 0.1, momentum=0.9)
# object that decreases learning rate by half every 10 epochs
scheduler = StepLR(optimizer, step_size=10, gamma=0.5)
# create 100 random data points and store in data loader class
x = torch.randn(100, D_i)
y = torch.randn(100, D_o)
data_loader = DataLoader(TensorDataset(x,y), batch_size=10, shuffle=True)
# loop over the dataset 100 times
for epoch in range(100):
epoch_loss = 0.0
# loop over batches
for i, data in enumerate(data_loader):
# retrieve inputs and labels for this batch
x_batch, y_batch = data
# zero the parameter gradients
optimizer.zero_grad()
# forward pass
pred = model(x_batch)
loss = criterion(pred, y_batch)
# backward pass
loss.backward()
# SGD update
optimizer.step()
# update statistics
epoch_loss += loss.item()
# print error
print(f'Epoch {epoch:5d}, loss {epoch_loss:.3f}')
# tell scheduler to consider updating learning rate
scheduler.step()
Figure 7.8 Sample code for training two-layer network on random data.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 113
computation is ecient, and to this end, the backpropagation algorithm was introduced.
Careful parameter initialization is also critical. The magnitudes of the hidden unit
activations can either decrease or increase exponentially in the forward pass. The same
is true of the gradient magnitudes in the backward pass, where these behaviors are known
as the vanishing gradient and exploding gradient problems. Both impede training but
can be avoided with appropriate initialization.
We’ve now dened the model and the loss function, and we can train a model for a
given task. The next chapter discusses how to measure the model performance.
Notes
Backpropagation: Ecient reuse of partial computations while calculating gradients in com-
putational graphs has been repeatedly discovered, including by Werbos (1974), Bryson et al.
(1979), LeCun (1985), and Parker (1985). However, the most celebrated description of this
idea was by Rumelhart et al. (1985) and Rumelhart et al. (1986), who also coined the term
“backpropagation. This latter work kick-started a new phase of neural network research in the
eighties and nineties; for the rst time, it was practical to train networks with hidden layers.
However, progress stalled due (in retrospect) to a lack of training data, limited computational
power, and the use of sigmoid activations. Areas such as natural language processing and com-
puter vision did not rely on neural network models until the remarkable image classication
results of Krizhevsky et al. (2012) ushered in the modern era of deep learning.
The implementation of backpropagation in modern deep learning frameworks such as PyTorch
and TensorFlow is an example of reverse-mode algorithmic dierentiation. This is distinguished
from forward-mode algorithmic dierentiation in which the derivatives from the chain rule
are accumulated while moving forward through the computational graph (see problem 7.13).
Further information about algorithmic dierentiation can be found in Griewank & Walther
(2008) and Baydin et al. (2018).
Initialization: He initialization was rst introduced by He et al. (2015). It follows closely
from Glorot or Xavier initialization (Glorot & Bengio, 2010), which is very similar but does
not consider the eect of the ReLU layer and so diers by a factor of two. Essentially the
same method was proposed much earlier by LeCun et al. (2012) but with a slightly dierent
motivation; in this case, sigmoidal activation functions were used, which naturally normalize the
range of outputs at each layer, and hence help prevent an exponential increase in the magnitudes
of the hidden units. However, if the pre-activations are too large, they fall into the at regions
of the sigmoid function and result in very small gradients. Hence, it is still important to
initialize the weights sensibly. Klambauer et al. (2017) introduce the scaled exponential linear
unit (SeLU) and show that, within a certain range of inputs, this activation function tends to
make the activations in network layers automatically converge to mean zero and unit variance.
A completely dierent approach is to pass data through the network and then normalize by the
empirically observed variance. Layer-sequential unit variance initialization (Mishkin & Matas,
2016) is an example of this kind of method, in which the weight matrices are initialized as
orthonormal. GradInit (Zhu et al., 2021) randomizes the initial weights and temporarily xes
them while it learns non-negative scaling factors for each weight matrix. These factors are
selected to maximize the decrease in the loss for a xed learning rate subject to a constraint
on the maximum gradient norm. Activation normalization or ActNorm adds a learnable scaling
and oset parameter after each network layer at each hidden unit. They run an initial batch
through the network and then choose the oset and scale so that the mean of the activations is
zero and the variance one. After this, these extra parameters are learned as part of the model.
Draft: please send errata to udlbookmail@gmail.com.
114 7 Gradients and initialization
Closely related to these methods are schemes such as BatchNorm (Ioe & Szegedy, 2015), in
which the network normalizes the variance of each batch as part of its processing at every
step. BatchNorm and its variants are discussed in chapter 11. Other initialization schemes have
been proposed for specic architectures, including the ConvolutionOrthogonal initializer (Xiao
et al., 2018a) for convolutional networks, Fixup (Zhang et al., 2019a) for residual networks, and
TFixup (Huang et al., 2020a) and DTFixup (Xu et al., 2021b) for transformers.
Reducing memory requirements: Training neural networks is memory intensive. We must
store both the model parameters and the pre-activations at the hidden units for every member
of the batch during the forward pass. Two methods that decrease memory requirements are
gradient checkpointing (Chen et al., 2016a) and micro-batching (Huang et al., 2019). In gradient
checkpointing, the activations are only stored every N layers during the forward pass. During
the backward pass, the intermediate missing activations are recalculated from the nearest check-
point. In this manner, we can drastically reduce the memory requirements at the computational
cost of performing the forward pass twice (problem 7.11). In micro-batching, the batch is sub-
divided into smaller parts, and the gradient updates are aggregated from each sub-batch before
being applied to the network. A completely dierent approach is to build a reversible network
(e.g., Gomez et al., 2017), in which the activations at the previous layer can be computed from
the activations at the current one, so there is no need to cache anything during the forward pass
(see chapter 16). Sohoni et al. (2019) review approaches to reducing memory requirements.
Distributed training: For suciently large models, the memory requirements or total re-
quired time may be too much for a single processor. In this case, we must use distributed
training, in which training takes place in parallel across multiple processors. There are several
approaches to parallelism. In data parallelism, each processor or node contains a full copy of
the model but runs a subset of the batch (see Xing et al., 2015; Li et al., 2020b). The gradients
from each node are aggregated centrally and then redistributed back to each node to ensure
that the models remain consistent. This is known as synchronous training. The synchronization
required to aggregate and redistribute the gradients can be a performance bottleneck, and this
leads to the idea of asynchronous training. For example, in the Hogwild! algorithm (Recht
et al., 2011), the gradient from a node is used to update a central model whenever it is ready.
The updated model is then redistributed to the node. This means that each node may have a
slightly dierent version of the model at any given time, so the gradient updates may be stale;
however, it works well in practice. Other decentralized schemes have also been developed. For
example, in Zhang et al. (2016a), the individual nodes update one another in a ring structure.
Data parallelism methods still assume that the entire model can be held in the memory of a
single node. Pipeline model parallelism stores dierent layers of the network on dierent nodes
and hence does not have this requirement. In a naïve implementation, the rst node runs the
forward pass for the batch on the rst few layers and passes the result to the next node, which
runs the forward pass on the next few layers and so on. In the backward pass, the gradients are
updated in the opposite order. The obvious disadvantage of this approach is that each machine
lies idle for most of the cycle. Various schemes revolving around each node processing micro-
batches sequentially have been proposed to reduce this ineciency (e.g., Huang et al., 2019;
Narayanan et al., 2021a). Finally, in tensor model parallelism, computation at a single network
layer is distributed across nodes (e.g., Shoeybi et al., 2019). A good overview of distributed
training methods can be found in Narayanan et al. (2021b), who combine tensor, pipeline, and
data parallelism to train a language model with one trillion parameters on 3072 GPUs.
Problems
Problem 7.1 A two-layer network with two hidden units in each layer can be dened as:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 115
y = ϕ
0
+ ϕ
1
a
h
ψ
01
+ ψ
11
a[θ
01
+ θ
11
x] + ψ
21
a[θ
02
+ θ
12
x]
i
+ϕ
2
a
h
ψ
02
+ ψ
12
a[θ
01
+ θ
11
x] + ψ
22
a[θ
02
+ θ
12
x]
i
, (7.34)
where the functions a[] are ReLU functions. Compute the derivatives of the output y with
respect to each of the 13 parameters ϕ
, θ
••
, and ψ
••
directly (i.e., not using the backpropagation
algorithm). The derivative of the ReLU function with respect to its input a[z]/∂z is the
indicator function I[z > 0], which returns one if the argument is greater than zero and zero
otherwise (gure 7.6).
Problem 7.2 Find an expression for the nal term in each of the ve chains of derivatives in
equation 7.12.
Problem 7.3 What size are each of the terms in equation 7.19?
Problem 7.4 Calculate the derivative
i
/∂f [x
i
, ϕ] for the least squares loss function:
i
= (y
i
f[x
i
, ϕ])
2
. (7.35)
Problem 7.5 Calculate the derivative
i
/∂f[x
i
, ϕ] for the binary classication loss function:
i
= (1 y
i
) log
h
1 sig
f[x
i
, ϕ]
i
y
i
log
h
sig
f[x
i
, ϕ]
i
, (7.36)
where the function sig[] is the logistic sigmoid and is dened as:
sig[z] =
1
1 + exp[z]
. (7.37)
Problem 7.6
Show that for z = β + Ωh:
z
h
=
T
,
where z/∂h is a matrix containing the term z
i
/∂h
j
in its i
th
column and j
th
row. To do this,
rst nd an expression for the constituent elements z
i
/∂h
j
, and then consider the form that
the matrix z/∂h must take.
Problem 7.7
Consider the case where we use the logistic sigmoid (see equation
7.37) as an
activation function, so h = sig[f]. Compute the derivative h/∂f for this activation function.
What happens to the derivative when the input takes (i) a large positive value and (ii) a large
negative value?
Problem 7.8 Consider using (i) the Heaviside function and (ii) the rectangular function as
activation functions:
Heaviside[z] =
(
0 z < 0
1 z 0
, (7.38)
Draft: please send errata to udlbookmail@gmail.com.
116 7 Gradients and initialization
Figure 7.9 Computational graph for problem 7.12 and problem 7.13. Adapted
from Domke (2010).
and
rect[z] =
0 z < 0
1 0 z 1
0 z > 1
. (7.39)
Discuss why these functions are problematic for neural network training with gradient-based
optimization methods.
Problem 7.9
Consider a loss function [f ], where f = β + Ωh. We want to nd how the loss
changes when we change , which we’ll express with a matrix that contains the derivative
ℓ/∂
ij
at the i
th
row and j
th
column. Find an expression for f
i
/∂
ij
and, using the chain
rule, show that:
=
f
h
T
. (7.40)
Problem 7.10
Derive the equations for the backward pass of the backpropagation algorithm
for a network that uses leaky ReLU activations, which are dened as:
a[z] = ReLU[z] =
(
α · z z < 0
z z 0
, (7.41)
where α is a small positive constant (typically 0.1).
Problem 7.11 Consider training a network with fty layers, where we only have enough memory
to store the pre-activations at every tenth hidden layer during the forward pass. Explain how
to compute the derivatives in this situation using gradient checkpointing.
Problem 7.12
This problem explores computing derivatives on general acyclic computational
graphs. Consider the function:
y = exp
exp[x] + exp[x]
2
+ sin[exp[x] + exp[x]
2
]. (7.42)
We can break this down into a series of intermediate computations so that:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 117
f
1
= exp[x]
f
2
= f
2
1
f
3
= f
1
+ f
2
f
4
= exp[f
3
]
f
5
= sin[f
3
]
y = f
4
+ f
5
. (7.43)
The associated computational graph is depicted in gure 7.9. Compute the derivative y/∂x
by reverse-mode dierentiation. In other words, compute in order:
y
f
5
,
y
f
4
,
y
f
3
,
y
f
2
,
y
f
1
and
y
x
, (7.44)
using the chain rule in each case to make use of the derivatives already computed.
Problem 7.13
For the same function as in problem 7.12, compute the derivative y/∂x by
forward-mode dierentiation. In other words, compute in order:
f
1
x
,
f
2
x
,
f
3
x
,
f
4
x
,
f
5
x
, and
y
x
, (7.45)
using the chain rule in each case to make use of the derivatives already computed. Why do
we not use forward-mode dierentiation when we calculate the parameter gradients for deep
networks?
Problem 7.14 Consider a random variable a with variance Var[a] = σ
2
and a symmetrical
distribution around the mean E[a] = 0. Prove that if we pass this variable through the ReLU
function:
b = ReLU[a] =
(
0 a < 0
a a 0
, (7.46)
then the second moment of the transformed variable is E[b
2
] = σ
2
/2.
Problem 7.15 What would you expect to happen if we initialized all of the weights and biases
in the network to zero?
Problem 7.16 Implement the code in gure 7.8 in PyTorch and plot the training loss as a
function of the number of epochs.
Problem 7.17 Change the code in gure 7.8 to tackle a binary classication problem. You will
need to (i) change the targets y so they are binary, (ii) change the network to predict numbers
between zero and one (iii) change the loss function appropriately.
Draft: please send errata to udlbookmail@gmail.com.
Chapter 8
Measuring performance
Previous chapters described neural network models, loss functions, and training algo-
rithms. This chapter considers how to measure the performance of the trained models.
With sucient capacity (i.e., number of hidden units), a neural network model will often
perform perfectly on the training data. However, this does not necessarily mean it will
generalize well to new test data.
We will see that the test errors have three distinct causes and that their relative
contributions depend on (i) the inherent uncertainty in the task, (ii) the amount of
training data, and (iii) the choice of model. The latter dependency raises the issue of
hyperparameter search. We discuss how to select both the model hyperparameters (e.g.,
the number of hidden layers and the number of hidden units in each) and the learning
algorithm hyperparameters (e.g., the learning rate and batch size).
8.1 Training a simple model
We explore model performance using the MNIST-1D dataset (gure 8.1). This con-
sists of ten classes y {0, 1, . . . , 9}, representing the digits 0–9. The data are derived
from 1D templates for each of the digits. Each data example x is created by randomly
transforming one of these templates and adding noise. The full training dataset {x
i
, y
i
}
consists of I =4000 training examples, each consisting of D
i
=40 dimensions representing
the horizontal oset at 40 positions. The ten classes are drawn uniformly during data
generation, so there are 400 examples of each class.
We use a network with D
i
= 40 inputs and D
o
= 10 outputs which are passed through
a softmax function to produce class probabilities (see section 5.5). The network has two
hidden layers with D = 100 hidden units each. It is trained using stochastic gradient
descent with batch size 100 and learning rate 0.1 for 6000 steps (150 epochs) with a
multiclass cross-entropy loss (equation 5.24). Figure 8.2 shows that the training error
decreases as training proceeds. The training data are classied perfectly after about
4000 steps. The training loss also decreases, eventually approaching zero.
Problem 8.1
However, this doesn’t imply that the classier is perfect; the model might have mem-
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.1 Training a simple model 119
Figure 8.1 MNIST-1D. a) Templates for 10 classes y {0, . . . , 9}, based on digits
0–9. b) Training examples x are created by randomly transforming a template
and c) adding noise. d) The horizontal oset of the transformed template is then
sampled at 40 vertical positions. Adapted from (Greydanus, 2020)
Figure 8.2 MNIST-1D results. a) Percent classication error as a function of the
training step. The training set errors decrease to zero, but the test errors do not
drop below 40%. This model doesn’t generalize well to new test data. b) Loss
as a function of the training step. The training loss decreases steadily toward
zero. The test loss decreases at rst but subsequently increases as the model
becomes increasingly condent about its (wrong) predictions.
Draft: please send errata to udlbookmail@gmail.com.
120 8 Measuring performance
Figure 8.3 Regression function. Solid
black line shows ground truth function.
To generate I training examples {x
i
, y
i
},
the input space x [0, 1] is divided
into I equal segments and one sample x
i
is drawn from a uniform distribution
within each segment. The correspond-
ing value y
i
is created by evaluating the
function at x
i
and adding Gaussian noise
(gray region shows ±2 standard devia-
tions). The test data are generated in
the same way.
orized the training set but be unable to predict new examples. To estimate the true
performance, we need a separate test set of input/output pairs {x
i
, y
i
}. To this end, we
generate 1000 more examples using the same process. Figure 8.2a also shows the errors
for this test data as a function of the training step. These decrease as training proceeds,
but only to around 40%. This is better than the chance error rate of 90% error rate but
far worse than for the training set; the model has not generalized well to the test data.
The test loss (gure 8.2b) decreases for the rst 1500 training steps but then increases
Notebook 8.1
MNIST-1D
performance
again. At this point, the test error rate is fairly constant; the model makes the same
mistakes but with increasing condence. This decreases the probability of the correct
answers and thus increases the negative log-likelihood. This increasing condence is a
side-eect of the softmax function; the pre-softmax activations are driven to increasingly
extreme values to make the probability of the training data approach one (see gure 5.10).
8.2 Sources of error
We now consider the sources of the errors that occur when a model fails to generalize.
To make this easier to visualize, we revert to a 1D linear least squares regression problem
where we know exactly how the ground truth data were generated. Figure 8.3 shows a
quasi-sinusoidal function; both training and test data are generated by sampling input
values in the range [0, 1], passing them through this function, and adding Gaussian noise
with a xed variance.
We t a simplied shallow neural net to this data (gure 8.4). The weights and biases
that connect the input layer to the hidden layer are chosen so that the “joints” of the
function are evenly spaced across the interval. If there are D hidden units, then these
joints will be at 0, 1/D, 2/D, . . . , (D 1)/D. This model can represent any piecewise
linear function with D equally sized regions in the range [0, 1]. As well as being easy to
understand, this model also has the advantage that it can be t in closed form without
the need for stochastic optimization algorithms (see problem 8.3). Consequently, we can
Problems 8.2–8.3
guarantee to nd the global minimum of the loss function during training.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.2 Sources of error 121
Figure 8.4 Simplied neural network with three hidden units. a) The weights and
biases between the input and hidden layer are xed (dashed arrows). b–d) They
are chosen so that the hidden unit activations have slope one, and their joints are
equally spaced across the interval, with joints at x = 0, x = 1/3, and x = 2/3,
respectively. Modifying the remaining parameters ϕ = {β, ω
1
, ω
2
, ω
3
} can create
any piecewise linear function over x [0, 1] with joints at 1/3 and 2/3. e–g)
Three example functions with dierent values of the parameters ϕ.
Draft: please send errata to udlbookmail@gmail.com.
122 8 Measuring performance
Figure 8.5 Sources of test error. a) Noise. Data generation is noisy, so even if the
model exactly replicates the true underlying function (black line), the noise in the
test data (gray points) means that some error will remain (gray region represents
two standard deviations). b) Bias. Even with the best possible parameters, the
three-region model (cyan line) cannot exactly t the true function (black line).
This bias is another source of error (gray regions represent signed error). c)
Variance. In practice, we have limited noisy training data (orange points). When
we t the model, we don’t recover the best possible function from panel (b) but
a slightly dierent function (cyan line) that reects idiosyncrasies of the training
data. This provides an additional source of error (gray region represents two
standard deviations). Figure 8.6 shows how this region was calculated.
8.2.1 Noise, bias, and variance
There are three possible sources of error, which are known as noise, bias, and variance
respectively (gure 8.5):
Noise The data generation process includes the addition of noise, so there are multiple
possible valid outputs y for each input x (gure 8.5a). This source of error is insurmount-
able for the test data. Note that it does not necessarily limit the training performance;
we will likely never see the same input x twice during training, so it is still possible to
t the training data perfectly.
Noise may arise because there is a genuine stochastic element to the data generation
process, because some of the data are mislabeled, or because there are further explanatory
variables that were not observed. In rare cases, noise may be absent; for example,
a network might approximate a function that is deterministic but requires signicant
computation to evaluate. However, noise is usually a fundamental limitation on the
possible test performance.
Bias A second potential source of error may occur because the model is not exible
enough to t the true function perfectly. For example, the three-region neural network
model cannot exactly describe the quasi-sinusoidal function, even when the parameters
are chosen optimally (gure 8.5b). This is known as bias.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.2 Sources of error 123
Variance We have limited training examples, and there is no way to distinguish sys-
tematic changes in the underlying function from noise in the underlying data. When
we t a model, we do not get the closest possible approximation to the true underly-
ing function. Indeed, for dierent training datasets, the result will be slightly dierent
each time. This additional source of variability in the tted function is termed variance
(gure 8.5c). In practice, there might also be additional variance due to the stochastic
learning algorithm, which does not necessarily converge to the same solution each time.
8.2.2 Mathematical formulation of test error
We now make the notions of noise, bias, and variance mathematically precise. Consider
a 1D regression problem where the data generation process has additive noise with vari-
ance σ
2
(e.g., gure 8.3); we can observe dierent outputs y for the same input x, so for
Appendix C.2
Expectation
each x, there is a distribution P r(y|x) with expected value (mean) µ[x]:
µ[x] = E
y
[y[x]] =
Z
y[x]P r(y|x)dy, (8.1)
and xed noise σ
2
= E
y
(µ[x] y[x])
2
. Here we have used the notation y[x] to specify
that we are considering the output y at a given input position x.
Now consider a least squares loss between the model prediction f[x, ϕ] at position x
and the observed value y[x] at that position:
L[x] =
f[x, ϕ] y[x]
2
(8.2)
=
f[x, ϕ] µ[x]
+
µ[x] y[x]
2
=
f[x, ϕ] µ[x]
2
+ 2
f[x, ϕ] µ[x]

µ[x] y[x]
+
µ[x] y[x]
2
,
where we have both added and subtracted the mean µ[x] of the underlying function in
the second line and have expanded out the squared term in the third line.
The underlying function is stochastic, so this loss depends on the particular y[x] we
observe. The expected loss is:
E
y
L[x]
= E
y
h
f[x, ϕ]µ[x]
2
+ 2
f[x, ϕ]µ[x]

µ[x]y[x]
+
µ[x]y[x]
2
i
=
f[x, ϕ]µ[x]
2
+ 2
f[x, ϕ] µ[x]

µ[x]E
y
[y[x]]
+ E
y
(µ[x]y[x])
2
=
f[x, ϕ]µ[x]
2
+ 2
f[x, ϕ]µ[x]
· 0 + E
y
h
µ[x]y[x]
2
i
=
f[x, ϕ] µ[x]
2
+ σ
2
, (8.3)
where we have made use of the rules for manipulating expectations. In the second line, we
Appendix C.2.1
Expectation rules
have distributed the expectation operator and removed it from terms with no dependence
on y[x], and in the third line, we note that the second term is zero since E
y
[y[x]] = µ[x]
by denition. Finally, in the fourth line, we have substituted in the denition of the
Draft: please send errata to udlbookmail@gmail.com.
124 8 Measuring performance
noise σ
2
. We can see that the expected loss has been broken down into two terms; the
rst term is the squared deviation between the model and the true function mean, and
the second term is the noise.
The rst term can be further partitioned into bias and variance. The parameters ϕ of
the model f[x, ϕ] depend on the training dataset D = {x
i
, y
i
}, so more properly, we should
write f [x, ϕ[D]]. The training dataset is a random sample from the data generation
process; with a dierent sample of training data, we would learn dierent parameter
values. The expected model output f
µ
[x] with respect to all possible datasets D is hence:
f
µ
[x] = E
D
h
f
x, ϕ[D]
i
. (8.4)
Returning to the rst term of equation 8.3, we add and subtract f
µ
[x] and expand:
f[x, ϕ[D]]µ[x]
2
(8.5)
=
f[x, ϕ[D]]f
µ
[x]
+
f
µ
[x] µ[x]
2
=
f[x, ϕ[D]]f
µ
[x]
2
+ 2
f[x, ϕ[D]]f
µ
[x]

f
µ
[x]µ[x]
+
f
µ
[x]µ[x]
2
.
We then take the expectation with respect to the training dataset D:
E
D
h
f[x, ϕ[D]] µ[x]
2
i
= E
D
h
f[x, ϕ[D]] f
µ
[x]
2
i
+
f
µ
[x] µ[x]
2
, (8.6)
where we have simplied using similar steps as for equation 8.3. Finally, we substitute
this result into equation 8.3:
E
D
h
E
y
[L[x]]
i
= E
D
h
f[x, ϕ[D]] f
µ
[x]
2
i
| {z }
variance
+
f
µ
[x]µ[x]
2
| {z }
bias
+ σ
2
.
|{z}
noise
(8.7)
This equation says that the expected loss after considering the uncertainty in the training
data D and the test data y consists of three additive components. The variance is
uncertainty in the tted model due to the particular training dataset we sample. The bias
is the systematic deviation of the model from the mean of the function we are modeling.
The noise is the inherent uncertainty in the true mapping from input to output. These
three sources of error will be present for any task. They combine additively for linear
regression with a least squares loss. However, their interaction can be more complex for
other types of problems.
8.3 Reducing error
In the previous section, we saw that test error results from three sources: noise, bias,
and variance. The noise component is insurmountable; there is nothing we can do to
circumvent this, and it represents a fundamental limit on model performance. However,
it is possible to reduce the other two terms.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.3 Reducing error 125
8.3.1 Reducing variance
Recall that the variance results from limited noisy training data. Fitting the model
to two dierent training sets results in slightly dierent parameters. It follows we can
reduce the variance by increasing the quantity of training data. This averages out the
inherent noise and ensures that the input space is well sampled.
Figure 8.6 shows the eect of training with 6, 10, and 100 samples. For each dataset
size, we show the best-tting model for three training datasets. With only six samples,
the tted function is quite dierent each time: the variance is signicant. As we increase
the number of samples, the tted models become very similar, and the variance reduces.
In general, adding training data almost always improves test performance.
8.3.2 Reducing bias
The bias term results from the inability of the model to describe the true underlying
function. This suggests that we can reduce this error by making the model more exible.
This is usually done by increasing the model capacity. For neural networks, this means
adding more hidden units and/or hidden layers.
In the simplied model, adding capacity corresponds to adding more hidden units
so that the interval [0, 1] is divided into more linear regions. Figures 8.7a–c show that
(unsurprisingly) this does indeed reduce the bias; as we increase the number of linear
regions to ten, the model becomes exible enough to t the true function closely.
8.3.3 Bias-variance trade-o
However, gures 8.7d–f show an unexpected side-eect of increasing the model capac-
ity. For a xed-size training dataset, the variance term increases as the model capacity
increases. Consequently, increasing the model capacity does not necessarily reduce the
test error. This is known as the bias-variance trade-o.
Figure 8.8 explores this phenomenon. In panels a–c), we t the simplied three-region
model to three dierent datasets of fteen points. Although the datasets dier, the nal
model is much the same; the noise in the dataset roughly averages out in each linear
region. In panels d–f), we t a model with ten regions to the same three datasets. This
model has more exibility, but this is disadvantageous; the model certainly ts the data
better, and the training error will be lower, but much of the extra descriptive power is
devoted to modeling the noise. This phenomenon is known as overtting.
We’ve seen that as we add capacity to the model, the bias decreases, but the variance
increases for a xed-size training dataset. This suggests that there is an optimal capacity
where the bias is not too large and the variance is still relatively small. Figure 8.9 shows
how these terms vary numerically for the toy model as we increase the capacity, using
Notebook 8.2
Bias-variance
trade-o
the data from gure 8.8. For regression models, the total expected error is the sum of
the bias and the variance, and this sum is minimized when the model capacity is four
(i.e., with four hidden units and four linear regions in the range of the data).
Draft: please send errata to udlbookmail@gmail.com.
126 8 Measuring performance
Figure 8.6 Reducing variance by increasing training data. a–c) The three-region
model tted to three dierent randomly sampled datasets of six points. The
tted model is quite dierent each time. d) We repeat this experiment many
times and plot the mean model predictions (cyan line) and the variance of the
model predictions (gray area shows two standard deviations). e–h) We do the
same experiment, but this time with datasets of size ten. The variance of the
predictions is reduced. i–l) We repeat this experiment with datasets of size 100.
Now the tted model is always similar, and the variance is small.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.4 Double descent 127
Figure 8.7 Bias and variance as a function of model capacity. a–c) As we in-
crease the number of hidden units of the toy model, the number of linear regions
increases, and the model becomes able to t the true function closely; the bias
(gray region) decreases. d–f) Unfortunately, increasing the model capacity has
the side-eect of increasing the variance term (gray region). This is known as the
bias-variance trade-o.
8.4 Double descent
In the previous section, we examined the bias-variance trade-o as we increased the
capacity of a model. Let’s now return to the MNIST-1D dataset and see whether this
happens in practice. We use 10,000 training examples, test with another 5,000 examples
and examine the training and test performance as we increase the capacity (number of
parameters) in the model. We train the model with Adam and a step size of 0.005 using
a full batch of 10,000 examples for 4000 steps.
Figure 8.10a shows the training and test error for a neural network with two hid-
den layers as the number of hidden units increases. The training error decreases as the
capacity grows and quickly becomes close to zero. The vertical dashed line represents
the capacity where the model has the same number of parameters as there are training
examples, but the model memorizes the dataset before this point. The test error de-
creases as we add model capacity but does not increase as predicted by the bias-variance
trade-o curve; it keeps decreasing.
In gure 8.10b, we repeat this experiment, but this time, we randomize 15% of the
Draft: please send errata to udlbookmail@gmail.com.
128 8 Measuring performance
Figure 8.8 Overtting. a–c) A model with three regions is t to three dierent
datasets of fteen points each. The result is similar in all three cases (i.e., the
variance is low). d–f) A model with ten regions is t to the same datasets. The
additional exibility does not necessarily produce better predictions. While these
three models each describe the training data better, they are not necessarily closer
to the true underlying function (black curve). Instead, they overt the data and
describe the noise, and the variance (dierence between tted curves) is larger.
Figure 8.9 Bias-variance trade-o. The
bias and variance terms from equa-
tion 8.7 are plotted as a function of
the model capacity (number of hidden
units / linear regions in range of data)
in the simplied model using training
data from gure 8.8. As the capacity
increases, the bias (solid orange line) de-
creases, but the variance (solid cyan line)
increases. The sum of these two terms
(dashed gray line) is minimized when the
capacity is four.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.4 Double descent 129
training labels. Once more, the training error decreases to zero. This time, there is
more randomness, and the model requires almost as many parameters as there are data
points to memorize the data. The test error does show the typical bias-variance trade-o
as we increase the capacity to the point where the model ts the training data exactly.
However, then it does something unexpected; it starts to decrease again. Indeed, if we
add enough capacity, the test loss reduces to below the minimal level that we achieved
in the rst part of the curve.
This phenomenon is known as double descent. For some datasets like MNIST, it is
present with the original data (gure 8.10c). For others, like MNIST-1D and CIFAR-100
(gure 8.10d), it emerges or becomes more prominent when we add noise to the labels.
Notebook 8.3
Double descent
The rst part of the curve is referred to as the classical or under-parameterized regime,
and the second part as the modern or over-parameterized regime. The central part where
the error increases is termed the critical regime.
8.4.1 Explanation
The discovery of double descent is recent, unexpected, and somewhat puzzling. It results
from an interaction of two phenomena. First, the test performance becomes temporarily
worse when the model has just enough capacity to memorize the data. Second, the test
performance continues to improve with capacity even after the training performance is
perfect. The rst phenomenon is exactly as predicted by the bias-variance trade-o. The
second phenomenon is more confusing; it’s unclear why performance should be better in
the over-parameterized regime, given that there are now not even enough training data
points to constrain the model parameters uniquely.
To understand why performance continues to improve as we add more parameters,
note that once the model has enough capacity to drive the training loss to near zero,
the model ts the training data almost perfectly. This implies that further capacity
Problems 8.4–8.5
cannot help the model t the training data any better; any change must occur between
the training points. The tendency of a model to prioritize one solution over another as
it extrapolates between data points is known as its inductive bias.
The model’s behavior between data points is critical because, in high-dimensional
space, the training data are extremely sparse. The MNIST-1D dataset has 40 dimensions,
and we trained with 10,000 examples. If this seems like plenty of data, consider what
would happen if we quantized each input dimension into 10 bins. There would be 10
40
bins in total, constrained by only 10
4
examples. Even with this coarse quantization,
there will only be one data point in every 10
35
bins! The tendency of the volume of
high-dimensional space to overwhelm the number of training points is termed the curse
of dimensionality.
The implication is that problems in high dimensions might look more like gure 8.11a;
there are small regions of the input space where we observe data with signicant gaps
between them. The putative explanation for double descent is that as we add capacity
to the model, it interpolates between the nearest data points increasingly smoothly. In
the absence of information about what happens between the training points, assuming
smoothness is sensible and will probably generalize reasonably to new data.
Draft: please send errata to udlbookmail@gmail.com.
130 8 Measuring performance
Figure 8.10 Double descent. a) Training and test loss on MNIST-1D for a two-
hidden layer network as we increase the number of hidden units (and hence pa-
rameters) in each layer. The training loss decreases to zero as the number of
parameters approaches the number of training examples (vertical dashed line).
The test error does not show the expected bias-variance trade-o but continues
to decrease even after the model has memorized the dataset. b) The same exper-
iment is repeated with noisier training data. Again, the training error reduces
to zero, although it now takes almost as many parameters as training points to
memorize the dataset. The test error shows the predicted bias/variance trade-o;
it decreases as the capacity increases but then increases again as we near the point
where the training data is exactly memorized. However, it subsequently decreases
again and ultimately reaches a better performance level. This is known as double
descent. Depending on the loss, the model, and the amount of noise in the data,
the double descent pattern can be seen to a greater or lesser degree across many
datasets. c) Results on MNIST (without label noise) with shallow neural network
from Belkin et al. (2019). d) Results on CIFAR-100 with ResNet18 network (see
chapter 11) from Nakkiran et al. (2021). See original papers for details.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.4 Double descent 131
Figure 8.11 Increasing capacity (hidden units) allows smoother interpolation be-
tween sparse data points. a) Consider this situation where the training data
(orange circles) are sparse; there is a large region in the center with no data ex-
amples to constrain the model to mimic the true function (black curve). b) If we
t a model with just enough capacity to t the training data (cyan curve), then it
has to contort itself to pass through the training data, and the output predictions
will not be smooth. c–f) However, as we add more hidden units, the model has
the ability to interpolate between the points more smoothly (smoothest possible
curve plotted in each case). However, unlike in this gure, it is not obliged to.
This argument is plausible. It’s certainly true that as we add more capacity to the
model, it will have the capability to create smoother functions. Figures 8.11b–f show the
smoothest possible functions that still pass through the data points as we increase the
number of hidden units. When the number of parameters is very close to the number
of training data examples (gure 8.11b), the model is forced to contort itself to t the
training data exactly, resulting in erratic predictions. This explains why the peak in the
double descent curve is so pronounced. As we add more hidden units, the model has the
ability to construct smoother functions that are likely to generalize better to new data.
However, this does not explain why over-parameterized models should produce smooth
functions. Figure 8.12 shows three functions that can be created by the simplied model
with 50 hidden units. In each case, the model ts the data exactly, so the loss is zero. If
the modern regime of double descent is explained by increasing smoothness, then what
exactly is encouraging this smoothness?
Draft: please send errata to udlbookmail@gmail.com.
132 8 Measuring performance
Figure 8.12 Regularization. a–c) Each of the three tted curves passes through
the data points exactly, so the training loss for each is zero. However, we might
expect the smooth curve in panel (a) to generalize much better to new data than
the erratic curves in panels (b) and (c). Any factor that biases a model toward
a subset of the solutions with a similar training loss is known as a regularizer.
It is thought that the initialization and/or tting of neural networks have an
implicit regularizing eect. Consequently, in the over-parameterized regime, more
reasonable solutions, such as that in panel (a), are encouraged.
The answer to this question is uncertain, but there are two likely possibilities. First,
the network initialization may encourage smoothness, and the model never departs from
the sub-domain of smooth function during the training process. Second, the training
algorithm may somehow “prefer” to converge to smooth functions. Any factor that
biases a solution toward a subset of equivalent solutions is known as a regularizer, so one
possibility is that the training algorithm acts as an implicit regularizer (see section 9.2).
8.5 Choosing hyperparameters
In the previous section, we discussed how test performance changes with model capac-
ity. Unfortunately, in the classical regime, we don’t have access to either the bias (which
requires knowledge of the true underlying function) or the variance (which requires mul-
tiple independently sampled datasets to estimate). In the modern regime, there is no
way to tell how much capacity should be added before the test error stops improving.
This raises the question of exactly how we should choose model capacity in practice.
For a deep network, the model capacity depends on the numbers of hidden layers
and hidden units per layer as well as other aspects of architecture that we have yet to
introduce. Furthermore, the choice of learning algorithm and any associated parameters
(learning rate, etc.) also aects the test performance. These elements are collectively
termed hyperparameters. The process of nding the best hyperparameters is termed
hyperparameter search or (when focused on network structure) neural architecture search.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
8.6 Summary 133
Hyperparameters are typically chosen empirically; we train many models with dier-
ent hyperparameters on the same training set, measure their performance, and retain the
best model. However, we do not measure their performance on the test set; this would
admit the possibility that these hyperparameters just happen to work well for the test
set but don’t generalize to further data. Instead, we introduce a third dataset known
as a validation set. For every choice of hyperparameters, we train the associated model
using the training set and evaluate performance on the validation set. Finally, we select
the model that worked best on the validation set and measure its performance on the
test set. In principle, this should give a reasonable estimate of the true performance.
The hyperparameter space is generally smaller than the parameter space but still
too large to try every combination exhaustively. Unfortunately, many hyperparameters
are discrete (e.g., the number of hidden layers), and others may be conditional on one
another (e.g., we only need to specify the number of hidden units in the tenth hidden
layer if there are ten or more layers). Hence, we cannot rely on gradient descent methods
as we did for learning the model parameters. Hyperparameter optimization algorithms
intelligently sample the space of hyperparameters, contingent on previous results. This
procedure is computationally expensive since we must train an entire model and measure
the validation performance for each combination of hyperparameters.
8.6 Summary
To measure performance, we use a separate test set. The degree to which performance is
maintained on this test set is known as generalization. Test errors can be explained by
three factors: noise, bias, and variance. These combine additively in regression problems
with least squares losses. Adding training data decreases the variance. When the model
capacity is less than the number of training examples, increasing the capacity decreases
bias but increases variance. This is known as the bias-variance trade-o, and there is a
capacity where the trade-o is optimal.
However, this is balanced against a tendency for performance to improve with ca-
pacity, even when the parameters exceed the training examples. Together, these two
phenomena create the double descent curve. It is thought that the model interpolates
more smoothly between the training data points in the over-parameterized “modern
regime,” although it is unclear what drives this. To choose the capacity and other model
and training algorithm hyperparameters, we t multiple models and evaluate their per-
formance using a separate validation set.
Notes
Bias-variance trade-o: We showed that the test error for regression problems with least
squares loss decomposes into the sum of noise, bias, and variance terms. These factors are
all present for models with other losses, but their interaction is typically more complicated
(Friedman, 1997; Domingos, 2000). For classication problems, there are some counter-intuitive
Draft: please send errata to udlbookmail@gmail.com.
134 8 Measuring performance
predictions; for example, if the model is biased toward selecting the wrong class in a region of
the input space, then increasing the variance can improve the classication rate as this pushes
some of the predictions over the threshold to be classied correctly.
Cross-validation: We saw that it is typical to divide the data into three parts: training
data (which is used to learn the model parameters), validation data (which is used to choose
the hyperparameters), and test data (which is used to estimate the nal performance). This
approach is known as cross-validation. However, this division may cause problems where the
total number of data examples is limited; if the number of training examples is comparable to
the model capacity, then the variance will be large.
One way to mitigate this problem is to use k-fold cross-validation. The training and validation
data are partitioned into K disjoint subsets. For example, we might divide these data into
ve parts. We train with four and validate with the fth for each of the ve permutations
and choose the hyperparameters based on the average validation performance. The nal test
performance is assessed using the average of the predictions from the ve models with the best
hyperparameters on an entirely dierent test set. There are many variations of this idea, but
all share the general goal of using a larger proportion of the data to train the model, thereby
reducing variance.
Capacity: We have used the term capacity informally to mean the number of parameters or
hidden units in the model (and hence indirectly, the ability of the model to t functions of
increasing complexity). The representational capacity of a model describes the space of possible
functions it can construct when we consider all possible parameter values. When we take into
account the fact that an optimization algorithm may not be able to reach all of these solutions,
what is left is the eective capacity.
The Vapnik-Chervonenkis (VC) dimension (Vapnik & Chervonenkis, 1971) is a more formal
measure of capacity. It is the largest number of training examples that a binary classier can
label arbitrarily. Bartlett et al. (2019) derive upper and lower bounds for the VC dimension in
terms of the number of layers and weights. An alternative measure of capacity is the Rademacher
complexity, which is the expected empirical performance of a classication model (with optimal
parameters) for data with random labels. Neyshabur et al. (2017) derive a lower bound on the
generalization error in terms of the Rademacher complexity.
Double descent: The term “double descent” was coined by Belkin et al. (2019), who demon-
strated that the test error decreases again in the over-parameterized regime for two-layer neural
networks and random features. They also claimed that this occurs in decision trees, although
Buschjäger & Morik (2021) subsequently provided evidence to the contrary. Nakkiran et al.
(2021) show that double descent occurs for various modern datasets (CIFAR-10, CIFAR-100,
IWSLT’14 de-en), architectures (CNNs, ResNets, transformers), and optimizers (SGD, Adam).
The phenomenon is more pronounced when noise is added to the target labels (Nakkiran et al.,
2021) and when some regularization techniques are used (Ishida et al., 2020).
Nakkiran et al. (2021) also provide empirical evidence that test performance depends on eective
model capacity (the largest number of samples for which a given model and training method can
achieve zero training error). At this point, the model starts to devote its eorts to interpolating
smoothly. As such, the test performance depends not just on the model but also on the training
algorithm and length of training. They observe the same pattern when they study a model with
xed capacity and increase the number of training iterations. They term this epoch-wise double
descent. This phenomenon has been modeled by Pezeshki et al. (2022) in terms of dierent
features in the model being learned at dierent speeds.
Double descent makes the rather strange prediction that adding training data can sometimes
worsen test performance. Consider an over-parameterized model in the second descending part
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 135
of the curve. If we increase the training data to match the model capacity, we will now be in
the critical region of the new test error curve, and the test loss may increase.
Bubeck & Sellke (2021) prove that overparameterization is necessary to interpolate data smoothly
in high dimensions. They demonstrate a trade-o between the number of parameters and the
Appendix B.1.1
Lipschitz constant
Lipschitz constant of a model (the fastest the output can change for a small input change). A
review of the theory of over-parameterized machine learning can be found in Dar et al. (2021).
Curse of dimensionality: As dimensionality increases, the volume of space grows so fast that
the amount of data needed to densely sample it increases exponentially. This phenomenon is
known as the curse of dimensionality. High-dimensional space has many unexpected properties,
and caution should be used when trying to reason about it based on low-dimensional exam-
ples. This book visualizes many aspects of deep learning in one or two dimensions, but these
visualizations should be treated with healthy skepticism.
Surprising properties of high-dimensional spaces include: (i) Two randomly sampled data points
from a standard normal distribution are very close to orthogonal to one another (relative to
Problems 8.6–8.9
the origin) with high likelihood. (ii) The distance from the origin of samples from a standard
normal distribution is roughly constant. (iii) Most of a volume of a high-dimensional sphere
(hypersphere) is adjacent to its surface (a common metaphor is that most of the volume of a high-
dimensional orange is in the peel, not in the pulp). (iv) If we place a unit-diameter hypersphere
inside a hypercube with unit-length sides, then the hypersphere takes up a decreasing proportion
of the volume of the cube as the dimension increases. Since the volume of the cube is xed at
Notebook 8.4
High-dimensional
spaces
size one, this implies that the volume of a high-dimensional hypersphere becomes close to zero.
(v) For random points drawn from a uniform distribution in a high-dimensional hypercube, the
ratio of the Euclidean distance between the nearest and furthest points becomes close to one.
For further information, consult Beyer et al. (1999) and Aggarwal et al. (2001).
Real-world performance: In this chapter, we argued that model performance could be evalu-
ated using a held-out test set. However, the result won’t be indicative of real-world performance
if the statistics of the test set don’t match those of real-world data. Moreover, the statistics
of real-world data may change over time, causing the model to become increasingly stale and
performance to decrease. This is known as data drift and means that deployed models must be
carefully monitored.
There are three main reasons why real-world performance may be worse than the test perfor-
mance implies. First, the statistics of the input data x may change; we may now be observing
parts of the function that were sparsely sampled or not sampled at all during training. This
is known as covariate shift. Second, the statistics of the output data y may change; if some
output values are infrequent during training, then the model may learn not to predict these in
ambiguous situations and will make mistakes if they are more common in the real world. This
is known as prior shift. Third, the relationship between input and output may change. This is
known as concept shift. These issues are discussed in Moreno-Torres et al. (2012).
Hyperparameter search: Finding the best hyperparameters is a challenging optimization
task. Testing a single conguration of hyperparameters is expensive; we must train an entire
model and measure its performance. We have no easy way to access the derivatives (i.e., how
performance changes when we make a small change to a hyperparameter). Moreover, many of
the hyperparameters are discrete, so we cannot use gradient descent methods. There are multiple
local minima and no way to tell if we are close to the global minimum. The noise level is high
since each training/validation cycle uses a stochastic training algorithm; we expect dierent
results if we train a model twice with the same hyperparameters. Finally, some variables are
conditional and only exist if others are set. For example, the number of hidden units in the
third hidden layer is only relevant if we have at least three hidden layers.
Draft: please send errata to udlbookmail@gmail.com.
136 8 Measuring performance
A simple approach is to sample the space randomly (Bergstra & Bengio, 2012). However,
for continuous variables, it is better to build a model of performance as a function of the
hyperparameters and the uncertainty in this function. This can be exploited to test where the
uncertainty is great (explore the space) or home in on regions where performance looks promising
(exploit previous knowledge). Bayesian optimization is a framework based on Gaussian processes
that does just this, and its application to hyperparameter search is described in Snoek et al.
(2012). The Beta-Bernoulli bandit (see Lattimore & Szepesvári, 2020) is a roughly equivalent
model for describing uncertainty in results due to discrete variables.
The sequential model-based conguration (SMAC) algorithm (Hutter et al., 2011) can cope with
continuous, discrete, and conditional parameters. The basic approach is to use a random forest
to model the objective function where the mean of the tree predictions is the best guess about
the objective function, and their variance represents the uncertainty. A completely dierent
approach that can also cope with combinations of continuous, discrete, and conditional param-
eters is Tree-Parzen Estimators (Bergstra et al., 2011). The previous methods modeled the
probability of the model performance given the hyperparameters. In contrast, the Tree-Parzen
estimator models the probability of the hyperparameters given the model performance.
Hyperband (Li et al., 2017b) is a multi-armed bandit strategy for hyperparameter optimization.
It assumes that there are computationally cheap but approximate ways to measure performance
(e.g., by not training to completion) and that these can be associated with a budget (e.g., by
training for a xed number of iterations). A number of random congurations are sampled and
run until the budget is used up. Then the best fraction η of runs is kept, and the budget is
multiplied by 1. This is repeated until the maximum budget is reached. This approach has
the advantage of eciency; for bad congurations, it does not need to run the experiment to the
end. However, each sample is just chosen randomly, which is inecient. The BOHB algorithm
(Falkner et al., 2018) combines the eciency of Hyperband with the more sensible choice of
hyperparameters from Tree Parzen estimators to construct an even better method.
Problems
Problem 8.1 Will the multiclass cross-entropy training loss in gure 8.2 ever reach zero? Explain
your reasoning.
Problem 8.2 What values should we choose for the three weights and biases in the rst layer of
the model in gure 8.4a so that the hidden unit’s responses are as depicted in gures 8.4b–d?
Problem 8.3
Given a training dataset consisting of I input/output pairs {x
i
, y
i
}, show how
the parameters {β, ω
1
, ω
2
, ω
3
} for the model in gure 8.4a using the least squares loss function
can be found in closed form.
Problem 8.4 Consider the curve in gure 8.10b at the point where we train a model with a
hidden layer of size 200, which would have 50,410 parameters. What do you predict will happen
to the training and test performance if we increase the number of training examples from 10,000
to 50,410?
Problem 8.5 Consider the case where the model capacity exceeds the number of training data
points, and the model is exible enough to reduce the training loss to zero. What are the
implications of this for tting a heteroscedastic model? Propose a method to resolve any
problems that you identify.
Problem 8.6 Show that two random points drawn from a 1000-dimensional standard Gaussian
distribution are orthogonal relative to the origin with high probability.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 137
Figure 8.13 Typical sets. a) Standard normal distribution in two dimensions.
Circles are four samples from this distribution. As the distance from the cen-
ter increases, the probability decreases, but the volume of space at that radius
(i.e., the area between adjacent evenly spaced circles) increases. b) These fac-
tors trade o so that the histogram of distances of samples from the center has
a pronounced peak. c) In higher dimensions, this eect becomes more extreme,
and the probability of observing a sample close to the mean becomes vanishingly
small. Although the most likely point is at the mean of the distribution, the
typical samples are found in a relatively narrow shell.
Problem 8.7 The volume of a hypersphere with radius r in D dimensions is:
Vol[r] =
r
D
π
D/2
Γ[D/2 + 1]
, (8.8)
where Γ[] is the Gamma function. Show using Stirling’s formula that the volume of a hyper-
Appendix B.1.3
Gamma function
Appendix B.1.4
Stirling’s formula
sphere of diameter one (radius r =0.5) becomes zero as the dimension increases.
Problem 8.8
Consider a hypersphere of radius r = 1. Find an expression for the proportion
of the total volume that lies in the outermost 1% of the distance from the center (i.e., in the
outermost shell of thickness 0.01). Show that this becomes one as the dimension increases.
Problem 8.9 Figure 8.13c shows the distribution of distances of samples of a standard normal
distribution as the dimension increases. Empirically verify this nding by sampling from the
standard normal distributions in 25, 100, and 500 dimensions and plotting a histogram of the
distances from the center. What closed-form probability distribution describes these distances?
Draft: please send errata to udlbookmail@gmail.com.
Chapter 9
Regularization
Chapter 8 described how to measure model performance and identied that there could
be a signicant performance gap between the training and test data. Possible reasons for
this discrepancy include: (i) the model describes statistical peculiarities of the training
data that are not representative of the true mapping from input to output (overtting),
and (ii) the model is unconstrained in areas with no training examples, leading to sub-
optimal predictions.
This chapter discusses regularization techniques. These are a family of methods that
reduce the generalization gap between training and test performance. Strictly speaking,
regularization involves adding explicit terms to the loss function that favor certain pa-
rameter choices. However, in machine learning, this term is commonly used to refer to
any strategy that improves generalization.
We start by considering regularization in its strictest sense. Then we show how
the stochastic gradient descent algorithm itself favors certain solutions. This is known
as implicit regularization. Following this, we consider a set of heuristic methods that
improve test performance. These include early stopping, ensembling, dropout, label
smoothing, and transfer learning.
9.1 Explicit regularization
Consider tting a model f[x, ϕ] with parameters ϕ using a training set {x
i
, y
i
} of in-
put/output pairs. We seek the minimum of the loss function L[ϕ] :
ˆ
ϕ = argmin
ϕ
L[ϕ]
= argmin
ϕ
"
I
X
i=1
i
[x
i
, y
i
]
#
, (9.1)
where the individual terms
i
[x
i
, y
i
] measure the mismatch between the network pre-
dictions f[x
i
, ϕ] and output targets y
i
for each training pair. To bias this minimization
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.1 Explicit regularization 139
Figure 9.1 Explicit regularization. a) Loss function for Gabor model (see sec-
tion 6.1.2). Cyan circles represent local minima. Gray circle represents the global
minimum. b) The regularization term favors parameters close to the center of the
plot by adding an increasing penalty as we move away from this point. c) The
nal loss function is the sum of the original loss function plus the regularization
term. This surface has fewer local minima, and the global minimum has moved
to a dierent position (arrow shows change).
toward certain solutions, we include an additional term:
ˆ
ϕ = argmin
ϕ
"
I
X
i=1
i
[x
i
, y
i
] + λ · g[ϕ]
#
, (9.2)
where g[ϕ] is a function that returns a scalar that takes a larger value when the pa-
rameters are less preferred. The term λ is a positive scalar that controls the relative
contribution of the original loss function and the regularization term. The minima of
the regularized loss function usually dier from those in the original, so the training
procedure converges to dierent parameter values (gure 9.1).
9.1.1 Probabilistic interpretation
Regularization can be viewed from a probabilistic perspective. Section 5.1 shows how
loss functions are constructed from the maximum likelihood criterion:
ˆ
ϕ = argmax
ϕ
"
I
Y
i=1
P r(y
i
|x
i
, ϕ)
#
. (9.3)
The regularization term can be considered as a prior P r(ϕ) that represents knowledge
about the parameters before we observe the data and we now have the maximum a
posteriori or MAP criterion:
Draft: please send errata to udlbookmail@gmail.com.
140 9 Regularization
ˆ
ϕ = argmax
ϕ
"
I
Y
i=1
P r(y
i
|x
i
, ϕ)P r(ϕ)
#
. (9.4)
Moving back to the negative log-likelihood loss function by taking the log and multiplying
by minus one, we see that λ ·g[ϕ] = log[P r(ϕ)].
9.1.2 L2 regularization
This discussion has sidestepped the question of which solutions the regularization term
should penalize (or equivalently that the prior should favor). Since neural networks are
used in an extremely broad range of applications, these can only be very generic pref-
erences. The most commonly used regularization term is the L2 norm, which penalizes
the sum of the squares of the parameter values:
ˆ
ϕ = argmin
ϕ
I
X
i=1
i
[x
i
, y
i
] + λ
X
j
ϕ
2
j
, (9.5)
where j indexes the parameters. This is also referred to as Tikhonov regularization or
Problems 9.1–9.2
ridge regression, or (when applied to matrices) Frobenius norm regularization.
For neural networks, L2 regularization is usually applied to the weights but not
the biases and is hence referred to as a weight decay term. The eect is to encourage
smaller weights, so the output function is smoother. To see this, consider that the
output prediction is a weighted sum of the activations at the last hidden layer. If the
Notebook 9.1
L2 regularization
weights have a smaller magnitude, the output will vary less. The same logic applies to
the computation of the pre-activations at the last hidden layer and so on, progressing
backward through the network. In the limit, if we forced all the weights to be zero, the
network would produce a constant output determined by the nal bias parameter.
Figure 9.2 shows the eect of tting the simplied network from gure 8.4 with weight
decay and dierent values of the regularization coecient λ. When λ is small, it has
little eect. However, as λ increases, the t to the data becomes less accurate, and the
function becomes smoother. This might improve the test performance for two reasons:
If the network is overtting, then adding the regularization term means that the
network must trade o slavish adherence to the data against the desire to be
smooth. One way to think about this is that the error due to variance reduces (the
model no longer needs to pass through every data point) at the cost of increased
bias (the model can only describe smooth functions).
When the network is over-parameterized, some of the extra model capacity de-
scribes areas with no training data. Here, the regularization term will favor func-
tions that smoothly interpolate between the nearby points. This is reasonable
behavior in the absence of knowledge about the true function.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.2 Implicit regularization 141
Figure 9.2 L2 regularization in simplied network (see gure 8.4). a–f) Fitted
functions as we increase the regularization coecient λ. The black curve is the
true function, the orange circles are the noisy training data, and the cyan curve
is the tted model. For small λ (panels a–b), the tted function passes exactly
through the data points. For intermediate λ (panels c–d), the function is smoother
and more similar to the ground truth. For large λ (panels e–f), the tted function
is smoother than the ground truth, so the t is worse.
9.2 Implicit regularization
An intriguing recent nding is that neither gradient descent nor stochastic gradient
descent moves neutrally to the minimum of the loss function; each exhibits a preference
for some solutions over others. This is known as implicit regularization.
9.2.1 Implicit regularization in gradient descent
Consider a continuous version of gradient descent where the step size is innitesimal.
The change in parameters ϕ will be governed by the dierential equation:
dϕ
dt
=
L
ϕ
. (9.6)
Gradient descent approximates this process with a series of discrete steps of size α:
Draft: please send errata to udlbookmail@gmail.com.
142 9 Regularization
Figure 9.3 Implicit regularization in gradient descent. a) Loss function with family
of global minima on horizontal line ϕ
1
= 0.61. Dashed blue line shows continuous
gradient descent path starting in bottom-left. Cyan trajectory shows discrete
gradient descent with step size 0.1 (rst few steps shown explicitly as arrows).
The nite step size causes the paths to diverge and reach a dierent nal position.
b) This disparity can be approximated by adding a regularization term to the
continuous gradient descent loss function that penalizes the squared gradient
magnitude. c) After adding this term, the continuous gradient descent path
converges to the same place that the discrete one did on the original function.
ϕ
t+1
= ϕ
t
α
L[ϕ
t
]
ϕ
, (9.7)
The discretization causes a deviation from the continuous path (gure 9.3).
This deviation can be understood by deriving a modied loss term
˜
L for the contin-
uous case that arrives at the same place as the discretized version on the original loss L.
It can be shown (see end of chapter) that this modied loss is:
˜
L
GD
[ϕ] = L[ϕ] +
α
4
L
ϕ
2
.
(9.8)
In other words, the discrete trajectory is repelled from places where the gradient norm
is large (the surface is steep). This doesn’t change the position of the minima where the
gradients are zero anyway. However, it changes the eective loss function elsewhere and
modies the optimization trajectory, which potentially converges to a dierent minimum.
Implicit regularization due to gradient descent may be responsible for the observation
that full batch gradient descent generalizes better with larger step sizes (gure 9.5a).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.2 Implicit regularization 143
Figure 9.4 Implicit regularization for stochastic gradient descent. a) Original loss
function for Gabor model (section 6.1.2). Blue point represents global minimum.
b) Implicit regularization term from gradient descent penalizes the squared gra-
dient magnitude. c) Additional implicit regularization from stochastic gradient
descent penalizes the variance of the batch gradients. d) Modied loss function
(sum of original loss plus two implicit regularization components). Blue point
represents global minimum which may now be in a dierent place from panel (a).
Draft: please send errata to udlbookmail@gmail.com.
144 9 Regularization
9.2.2 Implicit regularization in stochastic gradient descent
A similar analysis can be applied to stochastic gradient descent. Now we seek a modied
loss function such that the continuous version reaches the same place as the average of
the possible random SGD updates. This can be shown to be:
˜
L
SGD
[ϕ] =
˜
L
GD
[ϕ] +
α
4B
B
X
b=1
L
b
ϕ
L
ϕ
2
= L[ϕ] +
α
4
L
ϕ
2
+
α
4B
B
X
b=1
L
b
ϕ
L
ϕ
2
. (9.9)
Here, L
b
is the loss for the b
th
of the B batches in an epoch, and both L and L
b
now
represent the means of the I individual losses in the full dataset and the |B| individual
losses in the batch, respectively:
L =
1
I
I
X
i=1
i
[x
i
, y
i
] and L
b
=
1
|B|
X
i∈B
b
i
[x
i
, y
i
]. (9.10)
Equation 9.9 reveals an extra regularization term, which corresponds to the variance
of the gradients of the batch losses
L
b
. In other words, SGD implicitly favors places
where the gradients are stable (where all the batches agree on the slope). Once more, this
modies the trajectory of the optimization process (gure 9.4) but does not necessarily
change the position of the global minimum; if the model is over-parameterized, then it
may t all the training data exactly, so all of these gradient terms will all be zero at the
global minimum.
SGD generalizes better than gradient descent, and smaller batch sizes generally per-
form better than larger ones (gure 9.5b). One possible explanation is that the inherent
randomness allows the algorithm to reach dierent parts of the loss function. However,
Notebook 9.2
Implicit
regularization
it’s also possible that some or all of this performance increase is due to implicit regular-
ization; this encourages solutions where all the data ts well (so the batch variance is
small) rather than solutions where some of the data t extremely well and other data less
well (perhaps with the same overall loss, but with larger batch variance). The former
solutions are likely to generalize better.
9.3 Heuristics to improve performance
We’ve seen that explicit regularization encourages the training algorithm to nd a good
solution by adding extra terms to the loss function. This also occurs implicitly as an un-
intended (but seemingly helpful) byproduct of stochastic gradient descent. This section
describes other heuristic methods used to improve generalization.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.3 Heuristics to improve performance 145
Figure 9.5 Eect of learning rate (LR) and batch size for 4000 training and
4000 test examples from MNIST-1D (see gure 8.1) for a neural network with
two hidden layers. a) Performance is better for large learning rates than for
intermediate or small ones. In each case, the number of iterations is 6000/LR, so
each solution has the opportunity to move the same distance. b) Performance is
superior for smaller batch sizes. In each case, the number of iterations was chosen
so that the training data were memorized at roughly the same model capacity.
9.3.1 Early stopping
Early stopping refers to stopping the training procedure before it has fully converged.
This can reduce overtting if the model has already captured the coarse shape of the
underlying function but has not yet had time to overt to the noise (gure 9.6). One
way of thinking about this is that since the weights are initialized to small values (see
section 7.5), they simply don’t have time to become large, so early stopping has a similar
eect to explicit L2 regularization. A dierent view is that early stopping reduces the
eective model complexity. Hence, we move back down the bias/variance trade-o curve
from the critical region, and performance improves (see gures 8.9 and 8.10).
Early stopping has a single hyperparameter, the number of steps after which learning
is terminated. As usual, this is chosen empirically using a validation set (section 8.5).
However, for early stopping, the hyperparameter can be selected without the need to
train multiple models. The model is trained once, the performance on the validation set
is monitored every T iterations, and the associated models are stored. The stored model
where the validation performance was best is selected.
9.3.2 Ensembling
Another approach to reducing the generalization gap between training and test data is
to build several models and average their predictions. A group of such models is known
Draft: please send errata to udlbookmail@gmail.com.
146 9 Regularization
Figure 9.6 Early stopping. a) Simplied shallow network model with 14 linear
regions (gure 8.4) is initialized randomly (cyan curve) and trained with SGD
using a batch size of ve and a learning rate of 0.05. b–d) As training proceeds,
the function rst captures the coarse structure of the true function (black curve)
before e–f) overtting to the noisy training data (orange points). Although the
training loss continues to decrease throughout this process, the learned models in
panels (c) and (d) are closest to the true underlying function. They will generalize
better on average to test data than those in panels (e) or (f).
as an ensemble. This technique reliably improves test performance at the cost of training
and storing multiple models and performing inference multiple times.
The models can be combined by taking the mean of the outputs (for regression
problems) or the mean of the pre-softmax activations (for classication problems). The
assumption is that model errors are independent and will cancel out. Alternatively,
we can take the median of the outputs (for regression problems) or the most frequent
predicted class (for classication problems) to make the predictions more robust.
One way to train dierent models is just to use dierent random initializations. This
may help in regions of input space far from the training data. Here, the tted function
Notebook 9.3
Ensembling
is relatively unconstrained, and dierent models may produce dierent predictions, so
the average of several models may generalize better than any single model.
A second approach is to generate several dierent datasets by re-sampling the train-
ing data with replacement and training a dierent model from each. This is known as
bootstrap aggregating or bagging for short (gure 9.7). It has the eect of smoothing
out the data; if a data point is not present in one training set, the model will interpo-
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.3 Heuristics to improve performance 147
Figure 9.7 Ensemble methods. a) Fitting a single model (gray curve) to the
entire dataset (orange points). b–e) Four models created by re-sampling the data
with replacement (bagging) four times (size of orange point indicates number of
times the data point was re-sampled). f) When we average the predictions of this
ensemble, the result (cyan curve) is smoother than the result from panel (a) for
the full dataset (gray curve) and will probably generalize better.
late from nearby points; hence, if that point was an outlier, the tted function will be
more moderate in this region. Other approaches include training models with dierent
hyperparameters or training completely dierent families of models.
9.3.3 Dropout
Dropout randomly clamps a subset (typically 50%) of hidden units to zero at each iter-
ation of SGD (gure 9.8). This makes the network less dependent on any given hidden
unit and encourages the weights to have smaller magnitudes so that the change in the
function due to the presence or absence of the hidden unit is reduced.
This technique has the positive benet that it can eliminate undesirable “kinks” in
the function that are far from the training data and don’t aect the loss. For example,
consider three hidden units that become active sequentially as we move along the curve
(gure 9.9a). The rst hidden unit causes a large increase in the slope. A second hidden
Draft: please send errata to udlbookmail@gmail.com.
148 9 Regularization
Figure 9.8 Dropout. a) Original network. b–d) At each training iteration, a
random subset of hidden units is clamped to zero (gray nodes). The result is
that the incoming and outgoing weights from these units have no eect, so we are
training with a slightly dierent network each time.
unit decreases the slope, so the function goes back down. Finally, the third unit cancels
out this decrease and returns the curve to its original trajectory. These three units
conspire to make an undesirable local change in the function. This will not change the
training loss but is unlikely to generalize well.
When several units conspire in this way, eliminating one (as would happen in dropout)
causes a considerable change to the output function that is propagated to the half-space
where that unit was active (gure 9.9b). A subsequent gradient descent step will attempt
to compensate for the change that this induces, and such dependencies will be eliminated
over time. The overall eect is that large unnecessary changes between training data
points are gradually removed even though they contribute nothing to the loss (gure 9.9).
At test time, we can run the network as usual with all the hidden units active;
however, the network now has more hidden units than it was trained with at any given
iteration, so we multiply the weights by one minus the dropout probability to compensate.
This is known as the weight scaling inference rule. A dierent approach to inference is
to use Monte Carlo dropout, in which we run the network multiple times with dierent
random subsets of units clamped to zero (as in training) and combine the results. This
is closely related to ensembling in that every random version of the network is a dierent
model; however, we do not have to train or store multiple networks here.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.3 Heuristics to improve performance 149
Figure 9.9 Dropout mechanism. a) An undesirable kink in the curve is caused
by a sequential increase in the slope, decrease in the slope (at circled joint), and
then another increase to return the curve to its original trajectory. Here we are
using full-batch gradient descent, and the model (from gure 8.4) ts the data
as well as possible, so further training won’t remove the kink. b) Consider what
happens if we remove the eighth hidden unit that produced the circled joint in
panel (a), as might happen using dropout. Without the decrease in the slope,
the right-hand side of the function takes an upwards trajectory, and a subsequent
gradient descent step will aim to compensate for this change. c) Curve after 2000
iterations of (i) randomly removing one of the three hidden units that cause the
kink and (ii) performing a gradient descent step. The kink does not aect the loss
but is nonetheless removed by this approximation of the dropout mechanism.
9.3.4 Applying noise
Dropout can be interpreted as applying multiplicative Bernoulli noise to the network
activations. This leads to the idea of applying noise to other parts of the network during
training to make the nal model more robust.
One option is to add noise to the input data; this smooths out the learned function
Problem 9.3
(gure 9.10). For regression problems, it can be shown to be equivalent to adding a
regularizing term that penalizes the derivatives of the network’s output with respect to
its input. An extreme variant is adversarial training, in which the optimization algorithm
actively searches for small perturbations of the input that cause large changes to the
output. These can be thought of as worst-case additive noise vectors.
A second possibility is to add noise to the weights. This encourages the network to
make sensible predictions even for small perturbations of the weights. The result is that
the training converges to local minima in the middle of wide, at regions, where changing
the individual weights does not matter much.
Finally, we can perturb the labels. The maximum-likelihood criterion for multiclass
classication aims to predict the correct class with absolute certainty (equation 5.24).
To this end, the nal network activations (i.e., before the softmax function) are pushed
to very large values for the correct class and very small values for the wrong classes.
We could discourage this overcondent behavior by assuming that a proportion ρ of
Draft: please send errata to udlbookmail@gmail.com.
150 9 Regularization
Figure 9.10 Adding noise to inputs. At each step of SGD, random noise with
variance σ
2
x
is added to the batch data. a–c) Fitted model with dierent noise
levels (small dots represent ten samples). Adding more noise smooths out the
tted function (cyan line).
the training labels are incorrect and belong with equal probability to the other classes.
This could be done by randomly changing the labels at each training iteration. However,
the same end can be achieved by changing the loss function to minimize the cross-
entropy between the predicted distribution and a distribution where the true label has
Problem 9.4
probability 1 ρ, and the other classes have equal probability. This is known as label
smoothing and improves generalization in diverse scenarios.
9.3.5 Bayesian inference
The maximum likelihood approach is generally overcondent; in the training phase, it
selects the most likely parameters and bases its predictions on the model dened by these.
However, many parameter values may be broadly compatible with the data and only
slightly less likely. The Bayesian approach treats the parameters as unknown variables
Appendix C.1.4
Bayes’ rule
and computes a distribution P r(ϕ|{x
i
, y
i
}) over these parameters ϕ conditioned on the
training data {x
i
, y
i
} using Bayes’ rule:
P r(ϕ|{x
i
, y
i
}) =
Q
I
i=1
P r(y
i
|x
i
, ϕ)P r(ϕ)
R
Q
I
i=1
P r(y
i
|x
i
, ϕ)P r(ϕ)dϕ
, (9.11)
where P r(ϕ) is the prior probability of the parameters, and the denominator is a nor-
malizing term. Hence, every parameter choice is assigned a probability (gure 9.11).
The prediction y for new input x is an innite weighted sum (i.e., an integral) of the
predictions for each parameter set, where the weights are the associated probabilities:
P r(y|x, {x
i
, y
i
}) =
Z
P r(y|x, ϕ)P r(ϕ|{x
i
, y
i
})dϕ. (9.12)
This is eectively an innite weighted ensemble, where the weight depends on (i) the
prior probability of the parameters and (ii) their agreement with the data.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.3 Heuristics to improve performance 151
Figure 9.11 Bayesian approach for simplied network model (see gure 8.4). The
parameters are treated as uncertain. The posterior probability P r(ϕ|{x
i
, y
i
}) for
a set of parameters is determined by their compatibility with the data {x
i
, y
i
}
and a prior distribution P r(ϕ). a–c) Two sets of parameters (cyan and gray
curves) sampled from the posterior using normally distributed priors with mean
zero and three variances. When the prior variance σ
2
ϕ
is small, the parameters
also tend to be small, and the functions smoother. d–f) Inference proceeds by
taking a weighted sum over all possible parameter values where the weights are
the posterior probabilities. This produces both a prediction of the mean (cyan
curves) and the associated uncertainty (gray region is two standard deviations).
The Bayesian approach is elegant and can provide more robust predictions than
those that derive from maximum likelihood. Unfortunately, for complex models like
neural networks, there is no practical way to represent the full probability distribution
Notebook 9.4
Bayesian
approach
over the parameters or to integrate over it during the inference phase. Consequently, all
current methods of this type make approximations of some kind, and typically these add
considerable complexity to learning and inference.
9.3.6 Transfer learning and multi-task learning
When training data are limited, other datasets can be exploited to improve performance.
In transfer learning (gure 9.12a), the network is pre-trained to perform a related sec-
Draft: please send errata to udlbookmail@gmail.com.
152 9 Regularization
ondary task for which data are more plentiful. The resulting model is then adapted to
the original task. This is typically done by removing the last layer and adding one or
more layers that produce a suitable output. The main model may be xed, and the new
layers trained for the original task, or we may ne-tune the entire model.
The principle is that the network will build a good internal representation of the
data from the secondary task, which can subsequently be exploited for the original task.
Equivalently, transfer learning can be viewed as initializing most of the parameters of
the nal network in a sensible part of the space that is likely to produce a good solution.
Multi-task learning (gure 9.12b) is a related technique in which the network is trained
to solve several problems concurrently. For example, the network might take an image
and simultaneously learn to segment the scene, estimate the pixel-wise depth, and predict
a caption describing the image. All of these tasks require some understanding of the
image and, when learned simultaneously, the model performance for each may improve.
9.3.7 Self-supervised learning
The above discussion assumes that we have plentiful data for a secondary task or data for
multiple tasks to be learned concurrently. If not, we can create large amounts of “free”
labeled data using self-supervised learning and use this for transfer learning. There are
two families of methods for self-supervised learning: generative and contrastive.
In generative self-supervised learning, part of each data example is masked, and the
secondary task is to predict the missing part (gure 9.12c). For example, we might use
a corpus of unlabeled images and a secondary task that aims to inpaint (ll in) missing
parts of the image (gure 9.12c). Similarly, we might use a large corpus of text and mask
some words. We train the network to predict the missing words and then ne-tune it for
the actual language task we are interested in (see chapter 12).
In contrastive self-supervised learning, pairs of examples with commonalities are com-
pared to unrelated pairs. For images, the secondary task might be to identify whether a
pair of images are transformed versions of one another or are unconnected. For text, the
secondary task might be to determine whether two sentences followed one another in the
original document. Sometimes, the precise relationship between a connected pair must
be identied (e.g., nding the relative position of two patches from the same image).
9.3.8 Augmentation
Transfer learning improves performance by exploiting a dierent dataset. Multi-task
learning improves performance using additional labels. A third option is to expand the
dataset. We can often transform each input data example in such a way that the label
stays the same. For example, we might aim to determine if there is a bird in an image
(gure 9.13). Here, we could rotate, ip, blur, or manipulate the color balance of the
image, and the label “bird” remains valid. Similarly, for tasks where the input is text,
Notebook 9.5
Augmentation
we can substitute synonyms or translate to another language and back again. For tasks
where the input is audio, we can amplify or attenuate dierent frequency bands.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
9.3 Heuristics to improve performance 153
Figure 9.12 Transfer, multi-task, and self-supervised learning. a) Transfer learn-
ing is used when we have limited labeled data for the primary task (here depth
estimation) but plentiful data for a secondary task (here segmentation). We train
a model for the secondary task, remove the nal layers, and replace them with
new layers appropriate to the primary task. We then train only the new layers
or ne-tune the entire network for the primary task. The network learns a good
internal representation from the secondary task that is then exploited for the pri-
mary task. b) In multi-task learning, we train a model to perform multiple tasks
simultaneously, hoping that performance on each will improve. c) In generative
self-supervised learning, we remove part of the data and train the network to
complete the missing information. Here, the task is to ll in (inpaint) a masked
portion of the image. This permits transfer learning when no labels are available.
Images from Cordts et al. (2016).
Draft: please send errata to udlbookmail@gmail.com.
154 9 Regularization
Figure 9.13 Data augmentation. For some problems, each data example can be
transformed to augment the dataset. a) Original image. b–h) Various geometric
and photometric transformations of this image. For image classication, all these
images still have the same label, “bird. Adapted from Wu et al. (2015a).
Generating extra training data in this way is known as data augmentation. The aim
is to teach the model to be indierent to these irrelevant data transformations.
9.4 Summary
Explicit regularization involves adding an extra term to the loss function that changes
the position of the minimum. The term can be interpreted as a prior probability over
the parameters. Stochastic gradient descent with a nite step size does not neutrally
descend to the minimum of the loss function. This bias can be interpreted as adding
additional terms to the loss function, and this is known as implicit regularization.
There are also many heuristics for improving generalization, including early stopping,
dropout, ensembling, the Bayesian approach, adding noise, transfer learning, multi-task
learning, and data augmentation. There are four main principles behind these methods
(gure 9.14). We can (i) encourage the function to be smoother (e.g., L2 regularization),
(ii) increase the amount of data (e.g., data augmentation), (iii) combine models (e.g.,
ensembling), or (iv) search for wider minima (e.g., applying noise to network weights).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 155
Figure 9.14 Regularization methods. The regularization methods discussed in this
chapter aim to improve generalization by one of four mechanisms. Some methods
aim to make the modeled function smoother. Other methods increase the eective
amount of data. The third group of methods combine multiple models and hence
mitigate against uncertainty in the tting process. Finally, the fourth group of
methods encourages the training process to converge to a wide minimum where
small errors in the estimated parameters are less important (see also gure 20.11).
Another way to improve generalization is to choose the model architecture to suit the
task. For example, in image segmentation, we can share parameters within the model,
so we don’t need to independently learn what a tree looks like at every image location.
Chapters 10–13 consider architectural variations designed for dierent tasks.
Notes
An overview and taxonomy of regularization techniques in deep learning can be found in
Kukačka et al. (2017). Notably missing from the discussion in this chapter is BatchNorm
(Szegedy et al., 2016) at its variants, which are described in chapter 11.
Regularization: L2 regularization penalizes the sum of squares of the network weights. This
encourages the output function to change slowly (i.e., become smoother) and is the most used
regularization term. It is sometimes referred to as Frobenius norm regularization as it penalizes
the Frobenius norms of the weight matrices. It is often also mistakenly referred to as “weight
decay,” although this is a separate technique devised by Hanson & Pratt (1988) in which the
parameters ϕ are updated as:
ϕ (1 λ
)ϕ α
L
ϕ
, (9.13)
Draft: please send errata to udlbookmail@gmail.com.
156 9 Regularization
where, as usual, α is the learning rate, and L is the loss. This is identical to gradient descent,
except that the weights are reduced by a factor of 1λ
before the gradient update. For standard
SGD, weight decay is equivalent to L2 regularization (equation 9.5) with coecient λ = λ
/2α.
Problem 9.5
However, for Adam, the learning rate α is dierent for each parameter, so L2 regularization
and weight decay dier. Loshchilov & Hutter (2019) present AdamW, which modies Adam to
implement weight decay correctly and show that this improves performance.
Other choices of vector norm encourage sparsity in the weights. The L0 regularization term
Appendix B.3.2
Vector norms
applies a xed penalty for every non-zero weight. The eect is to “prune” the network. L0
regularization can also be used to encourage group sparsity; this might apply a xed penalty if
any of the weights contributing to a given hidden unit are non-zero. If they are all zero, we can
remove the unit, decreasing the model size and making inference faster.
Unfortunately, L0 regularization is challenging to implement since the derivative of the regular-
ization term is not smooth, and more sophisticated tting methods are required (see Louizos
et al., 2018). Somewhere between L2 and L0 regularization is L1 regularization or LASSO
(least absolute shrinkage and selection operator), which imposes a penalty on the absolute val-
ues of the weights. L2 regularization somewhat discourages sparsity in that the derivative of
the squared penalty decreases as the weight becomes smaller, lowering the pressure to make it
smaller still. L1 regularization does not have this disadvantage, as the derivative of the penalty
is constant. This can produce sparser solutions than L2 regularization but is much easier to
Problem 9.6
optimize than L0 regularization. Sometimes both L1 and L2 regularization terms are used,
which is termed an elastic net penalty (Zou & Hastie, 2005).
A dierent approach to regularization is to modify the gradients of the learning algorithm
without ever explicitly formulating a new loss function (e.g., equation 9.13). This approach has
been used to promote sparsity during backpropagation (Schwarz et al., 2021).
The evidence on the eectiveness of explicit regularization is mixed. Zhang et al. (2017a) showed
that L2 regularization contributes little to generalization. It has been proven that the Lipschitz
constant of the network (how fast the function can change as we modify the input) bounds
Appendix B.1.1
Lipschitz constant
the generalization error (Bartlett et al., 2017; Neyshabur et al., 2018). However, the Lipschitz
constant depends on the product of the spectral norms of the weight matrices
k
, which are
Appendix B.3.7
Spectral norm
only indirectly dependent on the magnitudes of the individual weights. Bartlett et al. (2017),
Neyshabur et al. (2018), and Yoshida & Miyato (2017) all add terms that indirectly encourage
the spectral norms to be smaller. Gouk et al. (2021) take a dierent approach and develop an
algorithm that constrains the Lipschitz constant of the network to be below a particular value.
Implicit regularization in gradient descent: The gradient descent step is:
ϕ
1
= ϕ
0
+ α · g[ϕ
0
], (9.14)
where g[ϕ
0
] is the negative of the gradient of the loss function, and α is the step size. As α 0,
the gradient descent process can be described by a dierential equation:
dϕ
dt
= g[ϕ]. (9.15)
For typical step sizes α, the discrete and continuous versions converge to dierent solutions. We
can use backward error analysis to nd a correction g
1
[ϕ] to the continuous version:
dϕ
dt
g[ϕ] + αg
1
[ϕ] + . . . , (9.16)
so that it gives the same result as the discrete version.
Consider the rst two terms of a Taylor expansion of the modied continuous solution ϕ around
initial position ϕ
0
:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 157
ϕ[α] ϕ + α
dϕ
dt
+
α
2
2
d
2
ϕ
dt
2
ϕ=ϕ
0
ϕ + α (g[ϕ] + αg
1
[ϕ]) +
α
2
2
g[ϕ]
ϕ
dϕ
dt
+ α
g
1
[ϕ]
ϕ
dϕ
dt
ϕ=ϕ
0
= ϕ + α (g[ϕ] + αg
1
[ϕ]) +
α
2
2
g[ϕ]
ϕ
g[ϕ] + α
g
1
[ϕ]
ϕ
g[ϕ]
ϕ=ϕ
0
ϕ + αg[ϕ] + α
2
g
1
[ϕ] +
1
2
g[ϕ]
ϕ
g[ϕ]
ϕ=ϕ
0
, (9.17)
where in the second line, we have introduced the correction term (equation 9.16), and in the
nal line, we have removed terms of greater order than α
2
.
Note that the rst two terms on the right-hand side ϕ
0
+ αg[ϕ
0
] are the same as the discrete
update (equation 9.14). Hence, to make the continuous and discrete versions arrive at the same
place, the third term on the right-hand side must equal zero, allowing us to solve for g
1
[ϕ]:
g
1
[ϕ] =
1
2
g[ϕ]
ϕ
g[ϕ]. (9.18)
During training, the evolution function g[ϕ] is the negative of the gradient of the loss:
dϕ
dt
g[ϕ] + αg
1
[ϕ]
=
L
ϕ
α
2
2
L
ϕ
2
L
ϕ
. (9.19)
This is equivalent to performing continuous gradient descent on the loss function:
L
GD
[ϕ] = L[ϕ] +
α
4
L
ϕ
2
, (9.20)
because the right-hand side of equation 9.19 is the derivative of that in equation 9.20.
This formulation of implicit regularization was developed by Barrett & Dherin (2021) and
extended to stochastic gradient descent by Smith et al. (2021). Smith et al. (2020) and others
have shown that stochastic gradient descent with small or moderate batch sizes outperforms full
batch gradient descent on the test set, and this may in part be due to implicit regularization.
Relatedly, Jastrzębski et al. (2021) and Cohen et al. (2021) both show that using a large learn-
ing rate reduces the tendency of typical optimization trajectories to move to “sharper” parts of
the loss function (i.e., where at least one direction has high curvature). This implicit regular-
ization eect of large learning rates can be approximated by penalizing the trace of the Fisher
Information Matrix, which is closely related to penalizing the gradient norm in equation 9.20
(Jastrzębski et al., 2021).
Early stopping: Bishop (1995) and Sjöberg & Ljung (1995) argued that early stopping limits
the eective solution space that the training procedure can explore; given that the weights are
initialized to small values, this leads to the idea that early stopping helps prevent the weights
from getting too large. Goodfellow et al. (2016) show that under a quadratic approximation
of the loss function with parameters initialized to zero, early stopping is equivalent to L2 reg-
ularization in gradient descent. The eective regularization weight λ is approximately 1/(τ α)
where α is the learning rate, and τ is the early stopping time.
Draft: please send errata to udlbookmail@gmail.com.
158 9 Regularization
Ensembling: Ensembles can be trained using dierent random seeds (Lakshminarayanan
et al., 2017), hyperparameters (Wenzel et al., 2020b), or even entirely dierent families of
models. The models can be combined by averaging their predictions, weighting the predictions,
or stacking (Wolpert, 1992), in which the results are combined using another machine learning
model. Lakshminarayanan et al. (2017) showed that averaging the output of independently
trained networks can improve accuracy, calibration, and robustness. Conversely, Frankle et al.
(2020) showed that if we average together the weights to make one model, the network fails.
Fort et al. (2019) compared ensembling solutions that resulted from dierent initializations
with ensembling solutions that were generated from the same original model. For example, in
the latter case, they consider exploring around the solution in a limited subspace to nd other
Appendix B.3.6
Subspaces
good nearby points. They found that both techniques provide complementary benets but that
genuine ensembling from dierent random starting points provides a bigger improvement.
An ecient way of ensembling is to combine models from intermediate stages of training. To this
end, Izmailov et al. (2018) introduce stochastic weight averaging, in which the model weights
are sampled at dierent time steps and averaged together. As the name suggests, snapshot
ensembles (Huang et al., 2017a) also store the models from dierent time steps and average
their predictions. The diversity of these models can be improved by cyclically increasing and
decreasing the learning rate. Garipov et al. (2018) observed that dierent minima of the loss
function are often connected by a low-energy path (i.e., a path with a low loss everywhere along
it). Motivated by this observation, they developed a method that explores low-energy regions
around an initial solution to provide diverse models without retraining. This is known as fast
geometric ensembling. A review of ensembling methods can be found in Ganaie et al. (2022).
Dropout: Dropout was rst introduced by Hinton et al. (2012b) and Srivastava et al. (2014).
Dropout is applied at the level of hidden units. Dropping a hidden unit has the same eect
as temporarily setting all the incoming and outgoing weights and the bias to zero. Wan et al.
(2013) generalized dropout by randomly setting individual weights to zero. Gal & Ghahramani
(2016) and Kendall & Gal (2017) proposed Monte Carlo dropout, in which inference is computed
with several dropout patterns, and the results are averaged together. Gal & Ghahramani (2016)
argued that this could be interpreted as approximating Bayesian inference.
Dropout is equivalent to applying multiplicative Bernoulli noise to the hidden units. Similar
benets derive from using other distributions, including the normal (Srivastava et al., 2014;
Shen et al., 2017), uniform (Shen et al., 2017), and beta distributions (Liu et al., 2019b).
Adding noise: Bishop (1995) and An (1996) added Gaussian noise to the network inputs to
improve performance. Bishop (1995) showed that this is equivalent to weight decay. An (1996)
also investigated adding noise to the weights. DeVries & Taylor (2017a) added Gaussian noise
to the hidden units. The randomized ReLU (Xu et al., 2015) applies noise in a dierent way by
making the activation functions stochastic.
Label smoothing: Label smoothing was introduced by Szegedy et al. (2016) for image classi-
cation but has since been shown to be helpful in speech recognition (Chorowski & Jaitly, 2017),
machine translation (Vaswani et al., 2017), and language modeling (Pereyra et al., 2017). The
precise mechanism by which label smoothing improves test performance isn’t well understood,
although Müller et al. (2019a) show that it improves the calibration of the predicted output
probabilities. A closely related technique is DisturbLabel (Xie et al., 2016), in which a certain
percentage of the labels in each batch are randomly switched at each training iteration.
Finding wider minima: It is thought that wider minima generalize better (see gure 20.11).
Here, the exact values of the weights are less important, so performance should be robust to
errors in their estimates. One of the reasons that applying noise to parts of the network during
training is eective is that it encourages the network to be indierent to their exact values.
Chaudhari et al. (2019) developed a variant of SGD that biases the optimization toward at
minima, which they call entropy SGD. The idea is to incorporate local entropy as a term in the
loss function. In practice, this takes the form of one SGD-like update within another. Keskar
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 159
et al. (2017) showed that SGD nds wider minima as the batch size is reduced. This may be
because of the batch variance term that results from implicit regularization by SGD.
Ishida et al. (2020) use a technique named ooding, in which they intentionally prevent the
training loss from becoming zero. This encourages the solution to perform a random walk over
the loss landscape and drift into a atter area with better generalization.
Bayesian approaches: For some models, including the simplied neural network model in
gure 9.11, the Bayesian predictive distribution can be computed in closed form (see Bishop,
2006; Prince, 2012). For neural networks, the posterior distribution over the parameters can-
not be represented in closed form and must be approximated. The two main approaches are
variational Bayes (Hinton & van Camp, 1993; MacKay, 1995; Barber & Bishop, 1997; Blundell
et al., 2015), in which the posterior is approximated by a simpler tractable distribution, and
Markov Chain Monte Carlo (MCMC) methods, which approximate the distribution by drawing
a set of samples (Neal, 1995; Welling & Teh, 2011; Chen et al., 2014; Ma et al., 2015; Li et al.,
2016a). The generation of samples can be integrated into SGD, and this is known as stochas-
tic gradient MCMC (see Ma et al., 2015). It has recently been discovered that “cooling” the
posterior distribution over the parameters (making it sharper) improves predictions from these
models (Wenzel et al., 2020a), but this is not currently fully understood (see Noci et al., 2021).
Transfer learning: Transfer learning for visual tasks works extremely well (Sharif Razavian
et al., 2014) and has supported rapid progress in computer vision, including the original AlexNet
results (Krizhevsky et al., 2012). Transfer learning has also impacted natural language process-
ing (NLP), where many models are based on pre-trained features from the BERT model (Devlin
et al., 2019). More information can be found in Zhuang et al. (2020) and Yang et al. (2020b).
Self-supervised learning: Self-supervised learning techniques for images have included in-
painting masked image regions (Pathak et al., 2016), predicting the relative position of patches
in an image (Doersch et al., 2015), re-arranging permuted image tiles back into their original
conguration (Noroozi & Favaro, 2016), colorizing grayscale images (Zhang et al., 2016b), and
transforming rotated images back to their original orientation (Gidaris et al., 2018). In Sim-
CLR (Chen et al., 2020c), a network is learned that maps versions of the same image that
have been photometrically and geometrically transformed to the same representation while re-
pelling versions of dierent images, with the goal of becoming indierent to irrelevant image
transformations. Jing & Tian (2020) present a survey of self-supervised learning in images.
Self-supervised learning in NLP can be based on predicting masked words(Devlin et al., 2019),
predicting the next word in a sentence (Radford et al., 2019; Brown et al., 2020), or predicting
whether two sentences follow one another (Devlin et al., 2019). In automatic speech recognition,
the Wav2Vec model (Schneider et al., 2019) aims to distinguish an original audio sample from
one where 10ms of audio has been swapped out from elsewhere in the clip. Self-supervision
has also been applied to graph neural networks (chapter 13). Tasks include recovering masked
features (You et al., 2020) and recovering the adjacency structure of the graph (Kipf & Welling,
2016). Liu et al. (2023a) review self-supervised learning for graph models.
Data augmentation: Data augmentation for images dates back to at least LeCun et al.
(1998) and contributed to the success of AlexNet (Krizhevsky et al., 2012), in which the dataset
was increased by a factor of 2048. Image augmentation approaches include geometric transfor-
mations, changing or manipulating the color space, noise injection, and applying spatial lters.
More elaborate techniques include randomly mixing images (Inoue, 2018; Summers & Dinneen,
2019), randomly erasing parts of the image (Zhong et al., 2020), style transfer (Jackson et al.,
2019), and randomly swapping image patches (Kang et al., 2017). In addition, many studies
have used generative adversarial networks or GANs (see chapter 15) to produce novel but plau-
sible data examples (e.g., Calimeri et al., 2017). In other cases, the data have been augmented
with adversarial examples (Goodfellow et al., 2015a), which are minor perturbations of the
training data that cause the example to be misclassied. A review of data augmentation for
images can be found in Shorten & Khoshgoftaar (2019).
Draft: please send errata to udlbookmail@gmail.com.
160 9 Regularization
Augmentation methods for acoustic data include pitch shifting, time stretching, dynamic range
compression, and adding random noise (e.g., Abeßer et al., 2017; Salamon & Bello, 2017; Xu
et al., 2015; Lasseck, 2018), as well as mixing data pairs (Zhang et al., 2017c; Yun et al., 2019),
masking features (Park et al., 2019), and using GANs to generate new data (Mun et al., 2017).
Augmentation for speech data includes vocal tract length perturbation (Jaitly & Hinton, 2013;
Kanda et al., 2013), style transfer (Gales, 1998; Ye & Young, 2004), adding noise (Hannun et al.,
2014), and synthesizing speech (Gales et al., 2009).
Augmentation methods for text include adding noise at a character level by switching, deleting,
and inserting letters (Belinkov & Bisk, 2018; Feng et al., 2020), or by generating adversarial
examples (Ebrahimi et al., 2018), using common spelling mistakes (Coulombe, 2018), randomly
swapping or deleting words (Wei & Zou, 2019), using synonyms (Kolomiyets et al., 2011),
altering adjectives (Li et al., 2017c), passivization (Min et al., 2020), using generative models
to create new data (Qiu et al., 2020), and round-trip translation to another language and back
(Aiken & Park, 2010). Augmentation methods for text are reviewed by Bayer et al. (2022).
Problems
Problem 9.1 Consider a model where the prior distribution over the parameters is a normal
distribution with mean zero and variance σ
2
ϕ
so that
P r(ϕ) =
J
Y
j=1
Norm
ϕ
j
[0, σ
2
ϕ
], (9.21)
where j indexes the model parameters. We now maximize
Q
I
i=1
P r(y
i
|x
i
, ϕ)P r(ϕ). Show that
the associated loss function of this model is equivalent to L2 regularization.
Problem 9.2 How do the gradients of the loss function change when L2 regularization (equa-
tion 9.5) is added?
Problem 9.3
Consider a linear regression model y = ϕ
0
+ ϕ
1
x with input x, output y, and
parameters ϕ
0
and ϕ
1
. Assume we have I training examples {x
i
, y
i
} and use a least squares
loss. Consider adding Gaussian noise with mean zero and variance σ
2
x
to the inputs x
i
at each
training iteration. What is the expected gradient update?
Problem 9.4
Derive the loss function for multiclass classication when we use label smooth-
ing so that the target probability distribution has 0.9 at the correct class and the remaining
probability mass of 0.1 is divided between the remaining D
o
1 classes.
Problem 9.5 Show that the weight decay parameter update with decay rate λ:
ϕ (1 λ)ϕ α
L
ϕ
, (9.22)
on the original loss function L[ϕ] is equivalent to a standard gradient update using L2 regular-
ization so that the modied loss function
˜
L[ϕ] is:
˜
L[ϕ] = L[ϕ] +
λ
2α
X
k
ϕ
2
k
, (9.23)
where ϕ are the parameters, and α is the learning rate.
Problem 9.6 Consider a model with parameters ϕ = [ϕ
0
, ϕ
1
]
T
. Draw the L0, L
1
2
, and L1
regularization terms in a similar form to gure 9.1b. The LP regularization term is
P
D
d=1
|ϕ
d
|
P
.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Chapter 10
Convolutional networks
Chapters 2–9 introduced the supervised learning pipeline for deep neural networks. How-
ever, these chapters only considered fully connected networks with a single path from
input to output. Chapters 10–13 introduce more specialized network components with
sparser connections, shared weights, and parallel processing paths. This chapter de-
scribes convolutional layers, which are mainly used for processing image data.
Images have three properties that suggest the need for specialized model architec-
ture. First, they are high-dimensional. A typical image for a classication task contains
224×224 RGB values (i.e., 150,528 input dimensions). Hidden layers in fully connected
networks are generally larger than the input size, so even for a shallow network, the
number of weights would exceed 150, 528
2
, or 22 billion. This poses obvious practical
problems in terms of the required training data, memory, and computation.
Second, nearby image pixels are statistically related. However, fully connected net-
works have no notion of “nearby” and treat the relationship between every input equally.
If the pixels of the training and test images were randomly permuted in the same way,
the network could still be trained with no practical dierence. Third, the interpretation
of an image is stable under geometric transformations. An image of a tree is still an
image of a tree if we shift it leftwards by a few pixels. However, this shift changes every
input to the network. Hence, a fully connected model must learn the patterns of pixels
that signify a tree separately at every position, which is clearly inecient.
Convolutional layers process each local image region independently, using parameters
shared across the whole image. They use fewer parameters than fully connected layers,
exploit the spatial relationships between nearby pixels, and don’t have to re-learn the
interpretation of the pixels at every position. A network predominantly consisting of
convolutional layers is known as a
convolutional neural network
or
CNN
.
10.1 Invariance and equivariance
We argued above that some properties of images (e.g., tree texture) are stable under
transformations. In this section, we make this idea more mathematically precise. A
Draft: please send errata to udlbookmail@gmail.com.
162 10 Convolutional networks
Figure 10.1 Invariance and equivariance for translation. a–b) In image classi-
cation, the goal is to categorize both images as “mountain” regardless of the
horizontal shift that has occurred. In other words, we require the network pre-
diction to be invariant to translation. c,e) The goal of semantic segmentation is
to associate a label with each pixel. d,f) When the input image is translated, we
want the output (colored overlay) to translate in the same way. In other words,
we require the output to be equivariant with respect to translation. Panels c–f)
adapted from Bousselham et al. (2021).
function f[x] of an image x is invariant to a transformation t[x] if:
f
t[x]
= f[x]. (10.1)
In other words, the output of the function f[x] is the same regardless of the transfor-
mation t[x]. Networks for image classication should be invariant to geometric trans-
formations of the image (gure 10.1a–b). The network f[x] should identify an image as
containing the same object, even if it has been translated, rotated, ipped, or warped.
A function f[x] of an image x is equivariant or covariant to a transformation t[x] if:
f
t[x]
= t
f[x]
. (10.2)
In other words, f[x] is equivariant to the transformation t[x] if its output changes in
the same way under the transformation as the input. Networks for per-pixel image
segmentation should be equivariant to transformations (gure 10.1c–f); if the image is
translated, rotated, or ipped, the network f[x] should return a segmentation that has
been transformed in the same way.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.2 Convolutional networks for 1D inputs 163
Figure 10.2 1D convolution with kernel size three. Each output z
i
is a weighted
sum of the nearest three inputs x
i1
, x
i
, and x
i+1
, where the weights are ω =
[ω
1
, ω
2
, ω
3
]. a) Output z
2
is computed as z
2
= ω
1
x
1
+ ω
2
x
2
+ ω
3
x
3
. b) Output z
3
is computed as z
3
= ω
1
x
2
+ ω
2
x
3
+ ω
3
x
4
. c) At position z
1
, the kernel extends
beyond the rst input x
1
. This can be handled by zero padding, in which we
assume values outside the input are zero. The nal output is treated similarly.
d) Alternatively, we could only compute outputs where the kernel ts within the
input range (“valid” convolution); now, the output will be smaller than the input.
10.2 Convolutional networks for 1D inputs
Convolutional networks consist of a series of convolutional layers, each of which is equiv-
ariant to translation. They also typically include pooling mechanisms that induce partial
invariance to translation. For clarity of exposition, we rst consider convolutional net-
works for 1D data, which are easier to visualize. In section 10.3, we progress to 2D
convolution, which can be applied to image data.
10.2.1 1D convolution operation
Convolutional layers are network layers based on the convolution operation. In 1D, a
convolution transforms an input vector x into an output vector z so that each output z
i
is a weighted sum of nearby inputs. The same weights are used at every position and
are collectively called the convolution kernel or lter. The size of the region over which
inputs are combined is termed the kernel size. For a kernel size of three, we have:
z
i
= ω
1
x
i1
+ ω
2
x
i
+ ω
3
x
i+1
, (10.3)
where ω = [ω
1
, ω
2
, ω
3
]
T
is the kernel (gure 10.2).
1
Notice that the convolution oper-
Problem 10.1
ation is equivariant with respect to translation. If we translate the input x, then the
corresponding output z is translated in the same way.
1
Strictly speaking, this is a cross-correlation and not a convolution, in which the weights would be
ipped relative to the input (so we would switch x
i1
with x
i+1
). Regardless, this (incorrect) denition
is the usual convention in machine learning.
Draft: please send errata to udlbookmail@gmail.com.
164 10 Convolutional networks
Figure 10.3 Stride, kernel size, and dilation. a) With a stride of two, we evaluate
the kernel at every other position, so the rst output z
1
is computed from a
weighted sum centered at x
1
, and b) the second output z
2
is computed from a
weighted sum centered at x
3
and so on. c) The kernel size can also be changed.
With a kernel size of ve, we take a weighted sum of the nearest ve inputs. d)
In dilated or atrous convolution (from the French “à trous” with holes), we
intersperse zeros in the weight vector to allow us to combine information over a
large area using fewer weights.
10.2.2 Padding
Equation 10.3 shows that each output is computed by taking a weighted sum of the
previous, current, and subsequent positions in the input. This begs the question of how
to deal with the rst output (where there is no previous input) and the nal output
(where there is no subsequent input).
There are two common approaches. The rst is to pad the edges of the inputs with
new values and proceed as usual. Zero padding assumes the input is zero outside its
valid range (gure 10.2c). Other possibilities include treating the input as circular or
reecting it at the boundaries. The second approach is to discard the output positions
where the kernel exceeds the range of input positions. These valid convolutions have the
advantage of introducing no extra information at the edges of the input. However, they
have the disadvantage that the representation decreases in size.
10.2.3 Stride, kernel size, and dilation
In the example above, each output was a sum of the nearest three inputs. However,
this is just one of a larger family of convolution operations, the members of which are
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.2 Convolutional networks for 1D inputs 165
distinguished by their stride, kernel size, and dilation rate. When we evaluate the output
at every position, we term this a stride of one. However, it is also possible to shift the
kernel by a stride greater than one. If we have a stride of two, we create roughly half
the number of outputs (gure 10.3a–b).
The kernel size can be increased to integrate over a larger area (gure 10.3c). How-
ever, it typically remains an odd number so that it can be centered around the current
position. Increasing the kernel size has the disadvantage of requiring more weights. This
leads to the idea of dilated or atrous convolutions, in which the kernel values are inter-
spersed with zeros. For example, we can turn a kernel of size ve into a dilated kernel of
size three by setting the second and fourth elements to zero. We still integrate informa-
Problems 10.2–10.4
tion from a larger input region but only require three weights to do this (gure 10.3d).
The number of zeros we intersperse between the weights determines the dilation rate.
10.2.4 Convolutional layers
A convolutional layer computes its output by convolving the input, adding a bias β, and
passing each result through an activation function a[]. With kernel size three, stride
one, and dilation rate one, the i
th
hidden unit h
i
would be computed as:
h
i
= a [β + ω
1
x
i1
+ ω
2
x
i
+ ω
3
x
i+1
]
= a
β +
3
X
j=1
ω
j
x
i+j2
, (10.4)
where the bias β and kernel weights ω
1
, ω
2
, ω
3
are trainable parameters, and (with zero
padding) we treat the input x as zero when it is out of the valid range. This is a special
case of a fully connected layer that computes the i
th
hidden unit as:
h
i
= a
β
i
+
D
X
j=1
ω
ij
x
j
. (10.5)
If there are D inputs x
and D hidden units h
, this fully connected layer would have D
2
weights ω
••
and D biases β
. The convolutional layer only uses three weights and one
bias. A fully connected layer can reproduce this exactly if most weights are set to zero
Problem 10.5
and others are constrained to be identical (gure 10.4).
10.2.5 Channels
If we only apply a single convolution, information will inevitably be lost; we are averaging
nearby inputs, and the ReLU activation function clips results that are less than zero.
Hence, it is usual to compute several convolutions in parallel. Each convolution produces
a new set of hidden variables, termed a feature map or channel.
Draft: please send errata to udlbookmail@gmail.com.
166 10 Convolutional networks
Figure 10.4 Fully connected vs. convolutional layers. a) A fully connected layer
has a weight connecting each input x to each hidden unit h (colored arrows)
and a bias for each hidden unit (not shown). b) Hence, the associated weight
matrix contains 36 weights relating the six inputs to the six hidden units. c) A
convolutional layer with kernel size three computes each hidden unit as the same
weighted sum of the three neighboring inputs (arrows) plus a bias (not shown).
d) The weight matrix is a special case of the fully connected matrix where many
weights are zero and others are repeated (same colors indicate same value, white
indicates zero weight). e) A convolutional layer with kernel size three and stride
two computes a weighted sum at every other position. f) This is also a special
case of a fully connected network with a dierent sparse weight structure.
Figure 10.5 Channels. Typically, multiple convolutions are applied to the input x
and stored in channels. a) A convolution is applied to create hidden units h
1
to h
6
, which form the rst channel. b) A second convolution operation is applied
to create hidden units h
7
to h
12
, which form the second channel. The channels
are stored in a 2D array H
1
that contains all the hidden units in the rst hidden
layer. c) If we add a further convolutional layer, there are now two channels at
each input position. Here, the 1D convolution denes a weighted sum over both
input channels at the three closest positions to create each new output channel.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.2 Convolutional networks for 1D inputs 167
Figure 10.5a–b illustrates this with two convolution kernels of size three and with
zero padding. The rst kernel computes a weighted sum of the nearest three pixels, adds
a bias, and passes the results through the activation function to produce hidden units h
1
to h
6
. These comprise the rst channel. The second kernel computes a dierent weighted
sum of the nearest three pixels, adds a dierent bias, and passes the results through the
activation function to create hidden units h
7
to h
12
. These comprise the second channel.
In general, the input and the hidden layers all have multiple channels (gure 10.5c).
If the incoming layer has
C
i
channels and kernel size
K
, the hidden units in each output
Problems 10.6–10.8
Notebook 10.1
1D convolution
channel are computed as a weighted sum over all C
i
channels and K kernel positions
using a weight matrix R
C
i
×K
and one bias. Hence, if there are C
o
channels in the
next layer, then we need R
C
i
×C
o
×K
weights and β R
C
o
biases.
10.2.6 Convolutional networks and receptive elds
Chapter 4 described deep networks, which consisted of a sequence of fully connected
layers. Similarly, convolutional networks comprise a sequence of convolutional layers.
The receptive eld of a hidden unit in the network is the region of the original input that
feeds into it. Consider a convolutional network where each convolutional layer has kernel
size three. The hidden units in the rst layer take a weighted sum of the three closest
inputs, so have receptive elds of size three. The units in the second layer take a weighted
sum of the three closest positions in the rst layer, which are themselves weighted sums
of three inputs. Hence, the hidden units in the second layer have a receptive eld of size
ve. In this way, the receptive eld of units in successive layers increases, and information
from across the input is gradually integrated (gure 10.6).
Problems 10.9–10.11
10.2.7 Example: MNIST-1D
We now apply a convolutional network to the MNIST-1D data (see gure 8.1). The
input x is a 40D vector, and the output f is a 10D vector that is passed through a
softmax layer to produce class probabilities. We use a network with three hidden layers
(gure 10.7). The fteen channels of the rst hidden layer H
1
are each computed using
a kernel size of three and a stride of two with “valid” padding, giving nineteen spatial
positions. The second hidden layer H
2
is also computed using a kernel size of three, a
stride of two, and “valid” padding. The third hidden layer is computed similarly. At this
stage, the representation has four spatial positions and fteen channels. These values
are reshaped into a vector of size sixty, which is mapped by a fully connected layer to
the ten output activations.
This network was trained for 100,000 steps using SGD without momentum, a learning
rate of 0.01, and a batch size of 100 on a dataset of 4,000 examples. We compare this to
Problem 10.12
a fully connected network with the same number of layers and hidden units (i.e., three
hidden layers with 285, 135, and 60 hidden units, respectively). The convolutional net-
work has 2,050 parameters, and the fully connected network has 59,065 parameters. By
the logic of gure 10.4, the convolutional network is a special case of the fully connected
Draft: please send errata to udlbookmail@gmail.com.
168 10 Convolutional networks
Figure 10.6 Receptive elds for network with kernel width of three. a) An input
with eleven dimensions feeds into a hidden layer with three channels and convo-
lution kernel of size three. The pre-activations of the three highlighted hidden
units in the rst hidden layer H
1
are dierent weighted sums of the nearest three
inputs, so the receptive eld in H
1
has size three. b) The pre-activations of the
four highlighted hidden units in layer H
2
each take a weighted sum of the three
channels in layer H
1
at each of the three nearest positions. Each hidden unit in
layer H
1
weights the nearest three input positions. Hence, hidden units in H
2
have a receptive eld size of ve. c) The hidden units in the third layer (kernel
size three, stride two) increases the receptive eld size to seven. d) By the time
we add a fourth layer, the receptive eld of the hidden units at position three
have a receptive eld that covers the entire input.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.2 Convolutional networks for 1D inputs 169
Figure 10.7 Convolutional network for classifying MNIST-1D data (see gure 8.1).
The MNIST-1D input has dimension D
i
= 40. The rst convolutional layer has
fteen channels, kernel size three, stride two, and only retains “valid” positions
to make a representation with nineteen positions and fteen channels. The fol-
lowing two convolutional layers have the same settings, gradually reducing the
representation size. Finally, a fully connected layer takes all sixty hidden units
from the third hidden layer. It outputs ten activations that are subsequently
passed through a softmax layer to produce the ten class probabilities.
Figure 10.8 MNIST-1D results. a) The convolutional network from gure 10.7
eventually ts the training data perfectly and has 17% test error. b) A fully
connected network with the same number of hidden layers and the number of
hidden units in each learns the training data faster but fails to generalize well with
40% test error. The latter model can reproduce the convolutional model but
fails to do so. The convolutional structure restricts the possible mappings to those
that process every position similarly, and this restriction improves performance.
Draft: please send errata to udlbookmail@gmail.com.
170 10 Convolutional networks
one. The latter has enough exibility to replicate the former exactly. Figure 10.8 shows
Notebook 10.2
Convolution
for MNIST-1D
both models t the training data perfectly. However, the test error for the convolutional
network is much less than for the fully connected network.
This discrepancy is probably not due to the dierence in the number of parameters;
we know overparameterization usually improves performance (section 8.4.1). The likely
explanation is that the convolutional architecture has a superior inductive bias (i.e.,
interpolates between the training data better) because we have embodied some prior
knowledge in the architecture; we have forced the network to process each position in
the input in the same way. We know that the data were created by starting with a
template that is (among other operations) randomly translated, so this is sensible.
The fully connected network has to learn what each digit template looks like at every
position. In contrast, the convolutional network shares information across positions and
hence learns to identify each category more accurately. Another way of thinking about
this is that when we train the convolutional network, we search through a smaller family
of input/output mappings, all of which are plausible. Alternatively, the convolutional
structure can be considered a regularizer that applies an innite penalty to most of the
solutions that a fully connected network can describe.
10.3 Convolutional networks for 2D inputs
The previous section described convolutional networks for processing 1D data. Such
networks can be applied to nancial time series, audio, and text. However, convolutional
networks are more usually applied to 2D image data. The convolutional kernel is now
a 2D object. A 3×3 kernel R
3
×
3
applied to a 2D input comprising of elements x
ij
computes a single layer of hidden units h
ij
as:
h
ij
= a
"
β +
3
X
m=1
3
X
n=1
ω
mn
x
i+m2,j+n2
#
, (10.6)
where ω
mn
are the entries of the convolutional kernel. This is simply a weighted sum
over a square 3×3 input region. The kernel is translated both horizontally and vertically
Problem 10.13
across the 2D input (gure 10.9) to create an output at each position.
Often the input is an RGB image, which is treated as a 2D signal with three channels
(gure 10.10). Here, a 3×3 kernel would have 3×3×3 weights and be applied to the
Notebook 10.3
2D convolution
three input channels at each of the 3×3 positions to create a 2D output that is the same
height and width as the input image (assuming zero padding). To generate multiple
Problem 10.14
output channels, we repeat this process with dierent kernel weights and append the
results to form a 3D tensor. If the kernel is size K ×K, and there are C
i
input channels,
Appendix B.3
Tensors
each output channel is a weighted sum of C
i
×K ×K quantities plus one bias. It follows
that to compute C
o
output channels, we need C
i
× C
o
× K × K weights and C
o
biases.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.4 Downsampling and upsampling 171
Figure 10.9 2D convolutional layer. Each output h
ij
computes a weighted sum of
the 3×3 nearest inputs, adds a bias, and passes the result through an activation
function. a) Here, the output h
23
(shaded output) is a weighted sum of the nine
positions from x
12
to x
34
(shaded inputs). b) Dierent outputs are computed by
translating the kernel across the image grid in two dimensions. c–d) With zero
padding, positions beyond the image’s edge are considered to be zero.
10.4 Downsampling and upsampling
The network in gure 10.7 increased receptive eld size by scaling down the representa-
tion at each layer using stride two convolutions. We now consider methods for scaling
down or downsampling 2D input representations. We also describe methods for scaling
them back up (upsampling), which is useful when the output is also an image. Finally,
we consider methods to change the number of channels between layers. This is helpful
when recombining representations from two branches of a network (chapter 11).
10.4.1 Downsampling
There are three main approaches to scaling down a 2D representation. Here, we consider
the most common case of scaling down both dimensions by a factor of two. First, we
Draft: please send errata to udlbookmail@gmail.com.
172 10 Convolutional networks
Figure 10.10 2D convolution applied to an image. The image is treated as a 2D
input with three channels corresponding to the red, green, and blue components.
With a 3×3 kernel, each pre-activation in the rst hidden layer is computed by
pointwise multiplying the 3×3×3 kernel weights with the 3×3 RGB image patch
centered at the same position, summing, and adding the bias. To calculate all
the pre-activations in the hidden layer, we “slide” the kernel over the image in
both horizontal and vertical directions. The output is a 2D layer of hidden units.
To create multiple output channels, we would repeat this process with multiple
kernels, resulting in a 3D tensor of hidden units at hidden layer H
1
.
can sample every other position. When we use a stride of two, we eectively apply this
Problem 10.15
method simultaneously with the convolution operation (gure 10.11a).
Second, max pooling retains the maximum of the 2×2 input values (gure 10.11b).
This induces some invariance to translation; if the input is shifted by one pixel, many
of these maximum values remain the same. Finally, mean pooling or average pooling
averages the inputs. For all approaches, we apply downsampling separately to each
channel, so the output has half the width and height but the same number of channels.
10.4.2 Upsampling
The simplest way to scale up a network layer to double the resolution is to duplicate
all the channels at each spatial position four times (gure 10.12a). A second method
is max unpooling; this is used where we have previously used a max pooling operation
for downsampling, and we distribute the values to the positions they originated from
(gure 10.12b). A third approach uses bilinear interpolation to ll in the missing values
between the points where we have samples. (gure 10.12c).
A fourth approach is roughly analogous to downsampling using a stride of two. In
Notebook 10.4
Downsampling
& upsampling
that method, there were half as many outputs as inputs, and for kernel size three, each
output was a weighted sum of the three closest inputs (gure 10.13a). In transposed
convolution, this picture is reversed (gure 10.13c). There are twice as many outputs
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.4 Downsampling and upsampling 173
Figure 10.11 Methods for scaling down representation size (downsampling). a)
Sub-sampling. The original 4×4 representation (left) is reduced to size 2×2 (right)
by retaining every other input. Colors on the left indicate which inputs contribute
to the outputs on the right. This is eectively what happens with a kernel of stride
two, except that the intermediate values are never computed. b) Max pooling.
Each output comprises the maximum value of the corresponding 2×2 block. c)
Mean pooling. Each output is the mean of the values in the 2×2 block.
Figure 10.12 Methods for scaling up representation size (upsampling). a) The
simplest way to double the size of a 2D layer is to duplicate each input four
times. b) In networks where we have previously used a max pooling operation
(gure 10.11b), we can redistribute the values to the same positions they originally
came from (i.e., where the maxima were). This is known as max unpooling. c) A
third option is bilinear interpolation between the input values.
Figure 10.13 Transposed convolution in 1D. a) Downsampling with kernel size
three, stride two, and zero padding. Each output is a weighted sum of three
inputs (arrows indicate weights). b) This can be expressed by a weight matrix
(same color indicates shared weight). c) In transposed convolution, each input
contributes three values to the output layer, which has twice as many outputs as
inputs. d) The associated weight matrix is the transpose of that in panel (b).
Draft: please send errata to udlbookmail@gmail.com.
174 10 Convolutional networks
as inputs, and each input contributes to three of the outputs. When we consider the
associated weight matrix of this upsampling mechanism (gure 10.13d), we see that it is
the transpose of the matrix for the downsampling mechanism (gure 10.13b).
10.4.3 Changing the number of channels
Sometimes we want to change the number of channels between one hidden layer and the
next without further spatial pooling. This is usually so we can combine the representation
with another parallel computation (see chapter 11). To accomplish this, we apply a
convolution with kernel size one. Each element of the output layer is computed by
taking a weighted sum of all the channels at the same position (gure 10.14). We can
repeat this multiple times with dierent weights to generate as many output channels as
we need. The associated convolution weights have size 1 × 1 × C
i
× C
o
. Hence, this is
known as 1×1 convolution. Combined with a bias and activation function, it is equivalent
to running the same fully connected network on the channels at every position.
10.5 Applications
We conclude by describing three computer vision applications. We describe convolu-
tional networks for image classication where the goal is to assign the image to one of a
predetermined set of categories. Then we consider object detection, where the goal is to
identify multiple objects in an image and nd the bounding box around each. Finally,
we describe an early system for semantic segmentation where the goal is to assign a label
to each pixel according to which object is present.
10.5.1 Image classication
Much of the pioneering work on deep learning in computer vision focused on image
classication using the ImageNet dataset (gure 10.15). This contains 1,281,167 training
images, 50,000 validation images, and 100,000 test images, and every image is labeled as
belonging to one of 1000 possible categories.
Most methods reshape the input images to a standard size; in a typical system,
the input x to the network is a 224×224 RGB image, and the output is a probability
distribution over the 1000 classes. The task is challenging; there are a large number
of classes, and they exhibit considerable variation (gure 10.15). In 2011, before deep
networks were applied, the state-of-the-art method classied the test images with 25%
errors for the correct class being in the top ve suggestions. Five years later, the best
deep learning models eclipsed human performance.
In 2012, AlexNet was the rst convolutional network to perform well on this task.
It consists of eight hidden layers with ReLU activation functions, of which the rst
ve are convolutional and the rest fully connected (gure 10.16). The network starts by
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.5 Applications 175
Figure 10.14 1×1 convolution. To change the number of channels without spatial
pooling, we apply a 1×1 kernel. Each output channel is computed by taking
a weighted sum of all of the channels at the same position, adding a bias, and
passing through an activation function. Multiple output channels are created by
repeating this operation with dierent weights and biases.
Figure 10.15 Example ImageNet classication images. The model aims to assign
an input image to one of 1000 classes. This task is challenging because the
images vary widely along dierent attributes (columns). These include rigidity
(monkey < canoe), number of instances in image (lizard < strawberry), clutter
(compass<steel drum), size (candle<spiderweb), texture (screwdriver<leopard),
distinctiveness of color (mug < red wine), and distinctiveness of shape (headland
< bell). Adapted from
Russakovsky et al. (2015).
Draft: please send errata to udlbookmail@gmail.com.
176 10 Convolutional networks
Figure 10.16 AlexNet (Krizhevsky et al.,
2012). The network maps a 224 ×224
color image to a 1000-dimensional vec-
tor representing class probabilities. The
network rst convolves with 11×11 ker-
nels and stride 4 to create 96 channels.
It decreases the resolution again using a
max pool operation and applies a 5×5
convolutional layer. Another max pool-
ing layer follows, and three 3×3 convo-
lutional layers are applied. After a -
nal max pooling operation, the result
is vectorized and passed through three
fully connected (FC) layers and nally
the softmax layer.
downsampling the input using an 11×11 kernel with a stride of four to create 96 channels.
It then downsamples again using a max pooling layer before applying a 5×5 kernel to
create 256 channels. There are three more convolutional layers with kernel size 3×3,
Problems 10.16–10.17
eventually resulting in a 13×13 representation with 256 channels. A nal max-pooling
layer yields a 6×6 representation with 256 channels which is resized into a vector of
length 9, 216 and passed through three fully connected layers containing 4096, 4096, and
1000 hidden units, respectively. The last layer is passed through the softmax function to
output a probability distribution over the 1000 classes. The complete network contains
60 million parameters, most of which are in the fully connected layers.
The dataset size was augmented by a factor of 2048 using (i) spatial transformations
Notebook 10.5
Convolution
for MNIST
and (ii) modications of the input intensities. At test time, ve dierent cropped and
mirrored versions of the image were run through the network, and their predictions
averaged. The system was learned using SGD with a momentum coecient of 0.9 and a
batch size of 128. Dropout was applied in the fully connected layers, and an L2 (weight
decay) regularizer was used. This system achieved a 16.4% top-5 error rate and a 38.1%
top-1 error rate. At the time, this was an enormous leap forward in performance at a task
considered far beyond the capabilities of contemporary methods. This result revealed
the potential of deep learning and kick-started the modern era of AI research.
The VGG network was also targeted at classication in the ImageNet task and
achieved a considerably better performance of 6.8% top-5 error rate and a 23.7% top-1
error rate. This network is similarly composed of a series of interspersed convolutional
and max pooling layers, where the spatial size of the representation gradually decreases,
but the number of channels increase. These are followed by three fully connected layers
(gure 10.17). The VGG network was also trained using data augmentation, weight
decay, and dropout.
Although there were various minor dierences in the training regime, the most impor-
tant change between AlexNet and VGG was the depth of the network. The latter used
Problem 10.18
19 hidden layers and 144 million parameters. The networks in gures 10.16 and 10.17
are depicted at the same scale for comparison. There was a general trend for several
years for performance on this task to improve as the depth of the networks increased,
and this is evidence that depth is important in neural networks.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.5 Applications 177
Figure 10.17 VGG network (Simonyan & Zisserman, 2014) depicted at the same
scale as AlexNet (see gure 10.16). This network consists of a series of convolu-
tional layers and max pooling operations, in which the spatial scale of the rep-
resentation gradually decreases, but the number of channels gradually increases.
The hidden layer after the last convolutional operation is resized to a 1D vector
and three fully connected layers follow. The network outputs 1000 activations
corresponding to the class labels that are passed through a softmax function to
create class probabilities.
10.5.2 Object detection
In object detection, the goal is to identify and localize multiple objects within the image.
An early method based on convolutional networks was You Only Look Once, or YOLO
for short. The input to the YOLO network is a 448×448 RGB image. This is passed
through 24 convolutional layers that gradually decrease the representation size using
max pooling operations while concurrently increasing the number of channels, similarly
to the VGG network. The nal convolutional layer is of size 7 ×7 and has 1024 channels.
This is reshaped to a vector, and a fully connected layer maps it to 4096 values. One
further fully connected layer maps this representation to the output.
The output values encode which class is present at each of a 7×7 grid of locations
(gure 10.18a–b). For each location, the output values also encode a xed number of
bounding boxes. Five parameters dene each box: the x- and y-positions of the center,
the height and width of the box, and the condence of the prediction (gure 10.18c).
The condence estimates the overlap between the predicted and ground truth bound-
ing boxes. The system is trained using momentum, weight decay, dropout, and data
augmentation. Transfer learning is employed; the network is initially trained on the
ImageNet classication task and is then ne-tuned for object detection.
After the network is run, a heuristic process is used to remove rectangles with low
condence and to suppress predicted bounding boxes that correspond to the same object
so only the most condent one is retained.
Draft: please send errata to udlbookmail@gmail.com.
178 10 Convolutional networks
Figure 10.18 YOLO object detection. a) The input image is reshaped to 448×448
and divided into a regular 7×7 grid. b) The system predicts the most likely class
at each grid cell. c) It also predicts two bounding boxes per cell, and a condence
value (represented by thickness of line). d) During inference, the most likely
bounding boxes are retained, and boxes with lower condence values that belong
to the same object are suppressed. Adapted from Redmon et al. (2016).
10.5.3 Semantic segmentation
The goal of semantic segmentation is to assign a label to each pixel according to the object
that it belongs to or no label if that pixel does not correspond to anything in the training
database. An early network for semantic segmentation is depicted in gure 10.19. The
input is a 224×224 RGB image, and the output is a 224×224×21 array that contains
the probability of each of 21 possible classes at each position.
The rst part of the network is a smaller version of VGG (gure 10.17) that contains
thirteen rather than sixteen convolutional layers and downsizes the representation to size
14×14. There is then one more max pooling operation, followed by two fully connected
layers that map to two 1D representations of size 4096. These layers do not represent
spatial position but instead, combine information from across the whole image.
Here, the architecture diverges from VGG. Another fully connected layer reconsti-
tutes the representation into 7×7 spatial positions and 512 channels. This is followed
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
10.6 Summary 179
Figure 10.19 Semantic segmentation network of Noh et al. (2015). The input is a
224×224 image, which is passed through a version of the VGG network and even-
tually transformed into a representation of size 4096 using a fully connected layer.
This contains information about the entire image. This is then reformed into a
representation of size 7×7 using another fully connected layer, and the image is
upsampled and deconvolved (transposed convolutions without upsampling) in a
mirror image of the VGG network. The output is a 224×224×21 representation
that gives the output probabilities for the 21 classes at each position.
by a series of max unpooling layers (see gure 10.12b) and deconvolution layers. These
are transposed convolutions (see gure 10.13) but in 2D and without the upsampling.
Finally, there is a 1×1 convolution to create 21 channels representing the possible classes
and a softmax operation at each spatial position to map the activations to class proba-
bilities. The downsampling side of the network is sometimes referred to as an encoder,
and the upsampling side as a decoder, so networks of this type are sometimes called
encoder-decoder networks or hourglass networks due to their shape.
The nal segmentation is generated using a heuristic method that greedily searches
for the class that is most represented and infers its region, taking into account the
probabilities but also encouraging connectedness. Then the next most-represented class
is added where it dominates at the remaining unlabeled pixels. This continues until there
is insucient evidence to add more (gure 10.20).
10.6 Summary
In convolutional layers, each hidden unit is computed by taking a weighted sum of the
nearby inputs, adding a bias, and applying an activation function. The weights and
the bias are the same at every spatial position, so there are far fewer parameters than
in a fully connected network, and the number of parameters doesn’t increase with the
input image size. To ensure that information is not lost, this operation is repeated with
Draft: please send errata to udlbookmail@gmail.com.
180 10 Convolutional networks
Figure 10.20 Semantic segmentation results. The nal result is created from the
21 probability maps by greedily selecting the best class and using a heuristic
method to nd a sensible binary map based on the probabilities and their spatial
proximity. If there is enough evidence, subsequent classes are added, and their
segmentation maps are combined. Adapted from Noh et al. (2015).
dierent weights and biases to create multiple channels at each spatial position.
Typical convolutional networks consist of convolutional layers interspersed with layers
that downsample by a factor of two. As the network progresses, the spatial dimensions
usually decrease by factors of two, and the number of channels increases by factors of
two. At the end of the network, there are typically one or more fully connected layers
that integrate information from across the entire input and create the desired output. If
the output is an image, a mirrored “decoder” upsamples back to the original size.
The translational equivariance of convolutional layers imposes a useful inductive bias
that increases performance for image-based tasks relative to fully connected networks.
We described image classication, object detection, and semantic segmentation networks.
Image classication performance was shown to improve as the network became deeper.
However, subsequent experiments showed that increasing the network depth indenitely
doesn’t continue to help; after a certain depth, the system becomes dicult to train.
This is the motivation for residual connections, which are the topic of the next chapter.
Notes
Dumoulin & Visin (2016) present an overview of the mathematics of convolutions that expands
on the brief treatment in this chapter.
Convolutional networks: Early convolutional networks were developed by Fukushima &
Miyake (1982), LeCun et al. (1989a), and LeCun et al. (1989b). Initial applications included
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 181
handwriting recognition (LeCun et al., 1989a; Martin, 1993), face recognition (Lawrence et al.,
1997), phoneme recognition (Waibel et al., 1989), spoken word recognition (Bottou et al., 1990),
and signature verication (Bromley et al., 1993). However, convolutional networks were popu-
larized by LeCun et al. (1998), who built a system called LeNet for classifying 28×28 grayscale
images of handwritten digits. This is immediately recognizable as a precursor of modern net-
works; it uses a series of convolutional layers, followed by fully connected layers, sigmoid activa-
tions rather than ReLUs, and average pooling rather than max pooling. AlexNet (Krizhevsky
et al., 2012) is widely considered the starting point for modern deep convolutional networks.
ImageNet Challenge: Deng et al. (2009) collated the ImageNet database and the associated
classication challenge drove progress in deep learning for several years after AlexNet. Notable
subsequent winners of this challenge include the network-in-network architecture (Lin et al.,
2014), which alternated convolutions with fully connected layers that operated independently
on all of the channels at each position (i.e., 1×1 convolutions). Zeiler & Fergus (2014) and
Simonyan & Zisserman (2014) trained larger and deeper architectures that were fundamentally
similar to AlexNet. Szegedy et al. (2017) developed an architecture called GoogLeNet, which
introduced inception blocks. These use several parallel paths with dierent lter sizes, which
are then recombined. This eectively allowed the system to learn the lter size.
The trend was for performance to improve with increasing depth. However, it ultimately became
dicult to train deeper networks without modications; these include residual connections
and normalization layers, both of which are described in the next chapter. Progress in the
ImageNet challenges is summarized in Russakovsky et al. (2015). A more general survey of
image classication using convolutional networks can be found in Rawat & Wang (2017). The
improvement of image classication networks over time is visualized in gure 10.21.
Types of convolutional layers: Atrous or dilated convolutions were introduced by Chen
et al. (2018c) and Yu & Koltun (2015). Transposed convolutions were introduced by Long et al.
(2015). Odena et al. (2016) pointed out that they can lead to checkerboard artifacts and should
be used with caution. Lin et al. (2014) is an early example of convolution with 1×1 lters.
Many variants of the standard convolutional layer aim to reduce the number of parameters.
These include depthwise or channel-separate convolution (Howard et al., 2017; Tran et al., 2018),
in which a dierent lter convolves each channel separately to create a new set of channels. For
a kernel size of K × K with C input channels and C output channels, this requires K ×K ×C
parameters rather than the K × K × C × C parameters in a regular convolutional layer. A
related approach is grouped convolutions (Xie et al., 2017), where each convolution kernel is
only applied to a subset of the channels with a commensurate reduction in the parameters. In
fact, grouped convolutions were used in AlexNet for computational reasons; the whole network
could not run on a single GPU, so some channels were processed on one GPU and some on
another, with limited interaction points. Separable convolutions treat each kernel as an outer
product of 1D vectors; they use C + K + K parameters for each of the C channels. Partial
convolutions (Liu et al., 2018a) are used when inpainting missing pixels and account for the
partial masking of the input. Gated convolutions learn the mask from the previous layer (Yu
et al., 2019; Chang et al., 2019b). Hu et al. (2018b) propose squeeze-and-excitation networks
which re-weight the channels using information pooled across all spatial positions.
Downsampling and upsampling: Average pooling dates back to at least LeCun et al. (1989a)
and max pooling to Zhou & Chellappa (1988). Scherer et al. (2010) compared these methods
and concluded that max pooling was superior. The max unpooling method was introduced by
Zeiler et al. (2011) and Zeiler & Fergus (2014). Max pooling can be thought of as applying
Draft: please send errata to udlbookmail@gmail.com.
182 10 Convolutional networks
Figure 10.21 ImageNet performance. Each circle represents a dierent published
model. Blue circles represent models that were state-of-the-art. Models dis-
cussed in this book are also highlighted. The AlexNet and VGG networks were
remarkable for their time but are now far from state of the art. ResNet-200 and
DenseNet are discussed in chapter 11. ImageGPT, ViT, SWIN, and DaViT are
discussed in chapter 12. Adapted from https://paperswithcode.com/sota/image-
classication-on-imagenet.
an L
norm to the hidden units that are to be pooled. This led to applying other L
k
norms
Appendix B.3.2
Vector norms
(Springenberg et al., 2015; Sainath et al., 2013), although these require more computation and
are not widely used. Zhang (2019) introduced max-blur-pooling, in which a low-pass lter is
applied before downsampling to prevent aliasing, and showed that this improves generalization
over translation of the inputs and protects against adversarial attacks (see section 20.4.6).
Shi et al. (2016) introduced PixelShue, which used convolutional lters with a stride of 1/s
to scale up 1D signals by a factor of s. Only the weights that lie exactly on positions are
used to create the outputs, and the ones that fall between positions are discarded. This can
be implemented by multiplying the number of channels in the kernel by a factor of s, where
the s
th
output position is computed from just the s
th
subset of channels. This can be trivially
extended to 2D convolution, which requires s
2
channels.
Convolution in 1D and 3D: Convolutional networks are usually applied to images but have
also been applied to 1D data in applications that include speech recognition (Abdel-Hamid
et al., 2012), sentence classication (Zhang et al., 2015; Conneau et al., 2017), electrocardiogram
classication (Kiranyaz et al., 2015), and bearing fault diagnosis (Eren et al., 2019). A survey
of 1D convolutional networks can be found in Kiranyaz et al. (2021). Convolutional networks
have also been applied to 3D data, including video (Ji et al., 2012; Saha et al., 2016; Tran et al.,
2015) and volumetric measurements (Wu et al., 2015b; Maturana & Scherer, 2015).
Invariance and equivariance: Part of the motivation for convolutional layers is that they
are approximately equivariant with respect to translation, and part of the motivation for max
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 183
pooling is to induce invariance to small translations. Zhang (2019) considers the degree to
which convolutional networks really have these properties and proposes the max-blur-pooling
modication that demonstrably improves them. There is considerable interest in making net-
works equivariant or invariant to other types of transformations, such as reections, rotations,
and scaling. Sifre & Mallat (2013) constructed a system based on wavelets that induced both
translational and rotational invariance in image patches and applied this to texture classica-
tion. Kanazawa et al. (2014) developed locally scale-invariant convolutional neural networks.
Cohen & Welling (2016) exploited group theory to construct group CNNs, which are equivariant
to larger families of transformations, including reections and rotations. Esteves et al. (2018)
introduced polar transformer networks, which are invariant to translations and equivariant to
rotation and scale. Worrall et al. (2017) developed harmonic networks, the rst example of a
group CNN that was equivariant to continuous rotations.
Initialization and regularization: Convolutional networks are typically initialized using
Xavier initialization (Glorot & Bengio, 2010) or He initialization (He et al., 2015), as described
in section 7.5. However, the ConvolutionOrthogonal initializer (Xiao et al., 2018a) is specialized
Problem 10.19
for convolutional networks (Xiao et al., 2018a). Networks of up to 10,000 layers can be trained
using this initialization without the need for residual connections.
Dropout is eective for fully connected networks but less so for convolutional layers (Park &
Kwak, 2016). This may be because neighboring image pixels are highly correlated, so if a hidden
unit drops out, the same information is passed on via adjacent positions. This is the motivation
for spatial dropout and cutout. In spatial dropout (Tompson et al., 2015), entire feature maps
are discarded instead of individual pixels. This circumvents the problem of neighboring pixels
carrying the same information. Similarly, DeVries & Taylor (2017b) propose cutout, in which a
square patch of each input image is masked at training time. Wu & Gu (2015) modied max
pooling for dropout layers using a method that involves sampling from a probability distribution
over the constituent elements rather than always taking the maximum.
Adaptive Kernels: The inception block (Szegedy et al., 2017) applies convolutional lters of
dierent sizes in parallel and, as such, provides a crude mechanism by which the network can
learn the appropriate lter size. Other work has investigated learning the scale of convolutions
as part of the training process (e.g., Pintea et al., 2021; Romero et al., 2021) or the stride of
downsampling layers (Riad et al., 2022).
In some systems, the kernel size is changed adaptively based on the data. This is sometimes in
the context of guided convolution, where one input is used to help guide the computation from
another input. For example, an RGB image might be used to help upsample a low-resolution
depth map. Jia et al. (2016) directly predicted the lter weights themselves using a dierent
network branch. Xiong et al. (2020b) change the kernel size adaptively. Su et al. (2019a)
moderate weights of xed kernels by a function learned from another modality. Dai et al.
(2017) learn osets of weights so that they do not have to be applied in a regular grid.
Object detection and semantic segmentation: Object detection methods can be divided
into proposal-based and proposal-free schemes. In the former case, processing occurs in two
stages. A convolutional network ingests the whole image and proposes regions that might
contain objects. These proposal regions are then resized, and a second network analyzes them
to establish whether there is an object there and what it is. An early example of this approach
was R-CNN (Girshick et al., 2014). This was subsequently extended to allow end-to-end training
(Girshick, 2015) and to reduce the cost of the region proposals (Ren et al., 2015). Subsequent
work on feature pyramid networks improved both performance and speed by combining features
Draft: please send errata to udlbookmail@gmail.com.
184 10 Convolutional networks
across multiple scales Lin et al. (2017b). In contrast, proposal-free schemes perform all the
processing in a single pass. YOLO Redmon et al. (2016), which was described in section 10.5.2,
is the most celebrated example of a proposal-free scheme. The most recent iteration of this
framework at the time of writing is YOLOv7 (Wang et al., 2022a). A recent review of object
detection can be found in Zou et al. (2023).
The semantic segmentation network described in section 10.5.3 was developed by Noh et al.
(2015). Many subsequent approaches have been variations of U-Net (Ronneberger et al., 2015),
which is described in section 11.5.3. Recent surveys of semantic segmentation can be found in
Minaee et al. (2021) and Ulku & Akagündüz (2022).
Visualizing Convolutional Networks: The dramatic success of convolutional networks led
to a series of eorts to visualize the information they extract from the image (see Qin et al., 2018,
for a review). Erhan et al. (2009) visualized the optimal stimulus that activated a hidden unit
by starting with an image containing noise and then optimizing the input to make the hidden
unit most active using gradient ascent. Zeiler & Fergus (2014) trained a network to reconstruct
the input and then set all the hidden units to zero except the one they were interested in;
the reconstruction then provides information about what drives the hidden unit. Mahendran
& Vedaldi (2015) visualized an entire layer of a network. Their network inversion technique
aimed to nd an image that resulted in the activations at that layer but also incorporates prior
knowledge that encourages this image to have similar statistics to natural images.
Finally, Bau et al. (2017) introduced network dissection. Here, a series of images with known
pixel labels capturing color, texture, and object type are passed through the network, and the
correlation of a hidden unit with each property is measured. This method has the advantage
that it only uses the forward pass of the network and does not require optimization. These
methods did provide some partial insight into how the network processes images. For example,
Bau et al. (2017) showed that earlier layers correlate more with texture and color and later
layers with the object type. However, it is fair to say that fully understanding the processing
of networks containing millions of parameters is currently not possible.
Problems
Problem 10.1
Show that the operation in equation 10.3 is equivariant with respect to transla-
tion.
Problem 10.2 Equation 10.3 denes 1D convolution with a kernel size of three, stride of one,
and dilation one. Write out the equivalent equation for the 1D convolution with a kernel size
of three and a stride of two as pictured in gure 10.3a–b.
Problem 10.3 Write out the equation for the 1D dilated convolution with a kernel size of three
and a dilation rate of two, as pictured in gure 10.3d.
Problem 10.4 Write out the equation for a 1D convolution with kernel size of seven, a dilation
rate of three, and a stride of three. You may assume that the input is padded with zeros at
positions x
2
, x
1
and x
0
.
Problem 10.5 Draw weight matrices in the style of gure 10.4d for (i) the strided convolution
in gure 10.3a–b, (ii) the convolution with kernel size 5 in gure 10.3c, and (iii) the dilated
convolution in gure 10.3d.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 185
Problem 10.6
Draw a 6×12 weight matrix in the style of gure 10.4d relating inputs x
1
, . . . , x
6
to outputs h
1
, . . . , h
12
in the multi-channel convolution as depicted in gures 10.5a–b.
Problem 10.7
Draw a 12×6 weight matrix in the style of gure 10.4d relating inputs h
1
, . . . , h
12
to outputs h
1
, . . . , h
6
in the multi-channel convolution in gure 10.5c.
Problem 10.8 Consider a 1D convolutional network where the input has three channels. The
rst hidden layer is computed using a kernel size of three and has four channels. The second
hidden layer is computed using a kernel size of ve and has ten channels. How many biases and
how many weights are needed for each of these two convolutional layers?
Problem 10.9 A network consists of three 1D convolutional layers. At each layer, a zero-padded
convolution with kernel size three, stride one, and dilation one is applied. What size is the
receptive eld of the hidden units in the third layer?
Problem 10.10 A network consists of three 1D convolutional layers. At each layer, a zero-
padded convolution with kernel size seven, stride one, and dilation one is applied. What size is
the receptive eld of hidden units in the third layer?
Problem 10.11 Consider a convolutional network with 1D input x. The rst hidden layer H
1
is
computed using a convolution with kernel size ve, stride two, and a dilation rate of one. The
second hidden layer H
2
is computed using a convolution with kernel size three, stride one, and
a dilation rate of one. The third hidden layer H
3
is computed using a convolution with kernel
size ve, stride one, and a dilation rate of two. What are the receptive eld sizes at each hidden
layer?
Problem 10.12 The 1D convolutional network in gure 10.7 was trained using stochastic gradient
descent with a learning rate of 0.01 and a batch size of 100 on a training dataset of 4,000 examples
for 100,000 steps. How many epochs was the network trained for?
Problem 10.13 Draw a weight matrix in the style of gure 10.4d that shows the relationship
between the 24 inputs and the 24 outputs in gure 10.9.
Problem 10.14 Consider a 2D convolutional layer with kernel size 5×5 that takes 3 input
channels and returns 10 output channels. How many convolutional weights are there? How
many biases?
Problem 10.15 Draw a weight matrix in the style of gure 10.4d that samples every other
variable in a 1D input (i.e., the 1D analog of gure 10.11a). Show that the weight matrix for
1D convolution with kernel size and stride two is equivalent to composing the matrices for 1D
convolution with kernel size one and this sampling matrix.
Problem 10.16
Consider the AlexNet network (gure 10.16). How many parameters are used
in each convolutional and fully connected layer? What is the total number of parameters?
Problem 10.17 What is the receptive eld size at each of the rst three layers of AlexNet
(gure 10.16)?
Problem 10.18 How many weights and biases are there at each convolutional layer and fully
connected layer in the VGG architecture (gure 10.17)?
Problem 10.19
Consider two hidden layers of size 224×224 with C
1
and C
2
channels, respec-
tively, connected by a 3×3 convolutional layer. Describe how to initialize the weights using He
initialization.
Draft: please send errata to udlbookmail@gmail.com.
Chapter 11
Residual networks
The previous chapter described how image classication performance improved as the
depth of convolutional networks was extended from eight layers (AlexNet) to eighteen
layers (VGG). This led to experimentation with even deeper networks. However, per-
formance decreased again when many more layers were added.
This chapter introduces residual blocks. Here, each network layer computes an addi-
tive change to the current representation instead of transforming it directly. This allows
deeper networks to be trained but causes an exponential increase in the activation mag-
nitudes at initialization. Residual blocks employ batch normalization to compensate for
this, which re-centers and rescales the activations at each layer.
Residual blocks with batch normalization allow much deeper networks to be trained,
and these networks improve performance across a variety of tasks. Architectures that
combine residual blocks to tackle image classication, medical image segmentation, and
human pose estimation are described.
11.1 Sequential processing
Every network we have seen so far processes the data sequentially; each layer receives
the previous layer’s output and passes the result to the next (gure 11.1). For example,
a three-layer network is dened by:
h
1
= f
1
[x, ϕ
1
]
h
2
= f
2
[h
1
, ϕ
2
]
h
3
= f
3
[h
2
, ϕ
3
]
y = f
4
[h
3
, ϕ
4
], (11.1)
where h
1
, h
2
, and h
3
denote the intermediate hidden layers, x is the network input, y
is the output, and the functions f
k
[, ϕ
k
] perform the processing.
In a standard neural network, each layer consists of a linear transformation followed
by an activation function, and the parameters ϕ
k
comprise the weights and biases of the
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
11.1 Sequential processing 187
Figure 11.1 Sequential processing. Standard neural networks pass the output of
each layer directly into the next layer.
linear transformation. In a convolutional network, each layer consists of a set of convolu-
tions followed by an activation function, and the parameters comprise the convolutional
kernels and biases.
Since the processing is sequential, we can equivalently think of this network as a
series of nested functions:
y = f
4
f
3
h
f
2
f
1
[x, ϕ
1
], ϕ
2
, ϕ
3
i
, ϕ
4
. (11.2)
11.1.1 Limitations of sequential processing
In principle, we can add as many layers as we want, and in the previous chapter, we saw
that adding more layers to a convolutional network does improve performance; the VGG
network (gure 10.17), which has eighteen layers, outperforms AlexNet (gure 10.16),
which has eight layers. However, image classication performance decreases again as
further layers are added (gure 11.2). This is surprising since models generally perform
better as more capacity is added (gure 8.10). Indeed, the decrease is present for both the
training set and the test set, which implies that the problem is training deeper networks
rather than the inability of deeper networks to generalize.
This phenomenon is not completely understood. One conjecture is that at initial-
ization, the loss gradients change unpredictably when we modify parameters in early
network layers. With appropriate initialization of the weights (see section 7.5), the gra-
dient of the loss with respect to these parameters will be reasonable (i.e., no exploding
or vanishing gradients). However, the derivative assumes an innitesimal change in the
parameter, whereas optimization algorithms use a nite step size. Any reasonable choice
Notebook 11.1
Shattered
gradients
of step size may move to a place with a completely dierent and unrelated gradient; the
loss surface looks like an enormous range of tiny mountains rather than a single smooth
structure that is easy to descend. Consequently, the algorithm doesn’t make progress in
the way that it does when the loss function gradient changes more slowly.
This conjecture is supported by empirical observations of gradients in networks with
a single input and output. For a shallow network, the gradient of the output with re-
spect to the input changes slowly as we change the input (gure 11.3a). However, for a
Appendix B.2.1
Autocorrelation
function
deep network, a tiny change in the input results in a completely dierent gradient (g-
ure 11.3b). This is captured by the autocorrelation function of the gradient (gure 11.3c).
Nearby gradients are correlated for shallow networks, but this correlation quickly drops
to zero for deep networks. This is termed the shattered gradients phenomenon.
Draft: please send errata to udlbookmail@gmail.com.
188 11 Residual networks
Figure 11.2 Decrease in performance when adding more convolutional layers. a) A
20-layer convolutional network outperforms a 56-layer neural network for image
classication on the test set of the CIFAR-10 dataset (Krizhevsky & Hinton,
2009). b) This is also true for the training set, which suggests that the problem
relates to training the original network rather than a failure to generalize to new
data. Adapted from He et al. (2016a).
Figure 11.3 Shattered gradients. a) Consider a shallow network with 200 hidden
units and Glorot initialization (He initialization without the factor of two) for
both the weights and biases. The gradient y/∂x of the scalar network output y
with respect to the scalar input x changes relatively slowly as we change the in-
put x. b) For a deep network with 24 layers and 200 hidden units per layer, this
gradient changes very quickly and unpredictably. c) The autocorrelation function
of the gradient shows that nearby gradients become unrelated (have autocorrela-
tion close to zero) for deep networks. This shattered gradients phenomenon may
explain why it is hard to train deep networks. Gradient descent algorithms rely
on the loss surface being relatively smooth, so the gradients should be related
before and after each update step. Adapted from Balduzzi et al. (2017).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
11.2 Residual connections and residual blocks 189
Shattered gradients presumably arise because changes in early network layers modify
the output in an increasingly complex way as the network becomes deeper. The derivative
of the output y with respect to the rst layer f
1
of the network in equation 11.1 is:
Appendix B.5
Matrix calculus
y
f
1
=
f
2
f
1
f
3
f
2
f
4
f
3
. (11.3)
When we change the parameters that determine f
1
, all of the derivatives in this sequence
can change since layers f
2
, f
3
, and f
4
are themselves computed from f
1
. Consequently,
the updated gradient at each training example may be completely dierent, and the loss
function becomes badly behaved.
1
11.2 Residual connections and residual blocks
Residual or skip connections are branches in the computational path, whereby the input
to each network layer f[] is added back to the output (gure 11.4a). By analogy to
equation 11.1, the residual network is dened as:
h
1
= x + f
1
[x, ϕ
1
]
h
2
= h
1
+ f
2
[h
1
, ϕ
2
]
h
3
= h
2
+ f
3
[h
2
, ϕ
3
]
y = h
3
+ f
4
[h
3
, ϕ
4
], (11.4)
where the rst term on the right-hand side of each line is the residual connection. Each
function f
k
learns an additive change to the current representation. It follows that their
outputs must be the same size as their inputs. Each additive combination of the input
and the processed output is known as a residual block or residual layer.
Once more, we can write this as a single function by substituting in the expressions
Problem 11.1
for the intermediate quantities h
k
:
y = x + f
1
[x] (11.5)
+ f
2
x + f
1
[x]
+ f
3
h
x + f
1
[x] + f
2
x + f
1
[x]
i
+ f
4
x + f
1
[x] + f
2
x + f
1
[x]
+ f
3
h
x + f
1
[x] + f
2
x + f
1
[x]
i
,
where we have omitted the parameters ϕ
for clarity. We can think of this equation as
“unraveling” the network (gure 11.4b). We see that the nal network output is a sum
of the input and four smaller networks, corresponding to each line of the equation; one
1
In equations 11.3 and 11.6, we overload notation to dene f
k
as the output of the function f
k
[].
Draft: please send errata to udlbookmail@gmail.com.
190 11 Residual networks
Figure 11.4 Residual connections. a) The output of each function f
k
[x, ϕ
k
] is
added back to its input, which is passed via a parallel computational path called
a residual or skip connection. Hence, the function computes an additive change
to the representation. b) Upon expanding (unraveling) the network equations, we
nd that the output is the sum of the input plus four smaller networks (depicted
in white, orange, gray, and cyan, respectively, and corresponding to terms in
equation 11.5); we can think of this as an ensemble of networks. Moreover,
the output from the cyan network is itself a transformation f
4
[, ϕ
4
] of another
ensemble, and so on. Alternatively, we can consider the network as a combination
of 16 dierent paths through the computational graph. One example is the dashed
path from input x to output y, which is the same in panels (a) and (b).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
11.2 Residual connections and residual blocks 191
Figure 11.5 Order of operations in resid-
ual blocks. a) The usual order of linear
transformation or convolution followed
by a ReLU nonlinearity means that each
residual block can only add non-negative
quantities. b) With the reverse order,
both positive and negative quantities can
be added. However, we must add a linear
transformation at the start of the net-
work in case the input is all negative. c)
In practice, it’s common for a residual
block to contain several network layers.
interpretation is that residual connections turn the original network into an ensemble of
these smaller networks whose outputs are summed to compute the result.
A complementary way of thinking about this residual network is that it creates sixteen
paths of dierent lengths from input to output. For example, the rst function f
1
[x]
Problem 11.2
occurs in eight of these sixteen paths, including as a direct additive term (i.e., a path
length of one), and the analogous derivative to equation 11.3 is:
Problem 11.3
y
f
1
= I +
f
2
f
1
+
f
3
f
1
+
f
2
f
1
f
3
f
2
+
f
4
f
1
+
f
2
f
1
f
4
f
2
+
f
3
f
1
f
4
f
3
+
f
2
f
1
f
3
f
2
f
4
f
3
, (11.6)
where there is one term for each of the eight paths. The identity term on the right-
hand side shows that changes in the parameters
ϕ
1
in the rst layer f
1
[x, ϕ
1
] contribute
directly to changes in the network output y. They also contribute indirectly through
the other chains of derivatives of varying lengths. In general, gradients through shorter
Notebook 11.2
Residual
networks
paths will be better behaved. Since both the identity term and various short chains of
derivatives will contribute to the derivative for each layer, networks with residual links
suer less from shattered gradients.
11.2.1 Order of operations in residual blocks
Until now, we have implied that the additive functions f[x] could be any valid network
layer (e.g., fully connected or convolutional). This is technically true, but the order of
operations in these functions is important. They must contain a nonlinear activation
function like a ReLU, or the entire network will be linear. However, in a typical network
layer (gure 11.5a), the ReLU function is at the end, so the output is non-negative. If
we adopt this convention, then each residual block can only increase the input values.
Hence, it is typical to change the order of operations so that the activation function is
applied rst, followed by the linear transformation (gure 11.5b). Sometimes there may
be several layers of processing within the residual block (gure 11.5c), but these usually
terminate with a linear transformation. Finally, we note that when we start these blocks
with a ReLU operation, they will do nothing if the initial network input is negative since
the ReLU will clip the entire signal to zero. Hence, it’s typical to start the network with
a linear transformation rather than a residual block, as in gure 11.5b.
Draft: please send errata to udlbookmail@gmail.com.
192 11 Residual networks
11.2.2 Deeper networks with residual connections
Adding residual connections roughly doubles the depth of a network that can be practi-
cally trained before performance degrades. However, we would like to increase the depth
further. To understand why residual connections do not allow us to increase the depth
arbitrarily, we must consider how the variance of the activations changes during the
forward pass and how the gradient magnitudes change during the backward pass.
11.3 Exploding gradients in residual networks
In section 7.5, we saw that initializing the network parameters is critical. Without
careful initialization, the magnitudes of the intermediate values during the forward pass
of backpropagation can increase or decrease exponentially. Similarly, the gradients during
the backward pass can explode or vanish as we move backward through the network.
Hence, we initialize the network parameters so that the expected variance of the
activations (in the forward pass) and gradients (in the backward pass) remains the same
between layers. He initialization (section 7.5) achieves this for ReLU activations by
initializing the biases β to zero and choosing normally distributed weights with mean
zero and variance 2/D
h
where D
h
is the number of hidden units in the previous layer.
Now consider a residual network. We do not have to worry about the intermediate
values or gradients vanishing with network depth since there exists a path whereby
each layer directly contributes to the network output (equation 11.5 and gure 11.4b).
However, even if we use He initialization within the residual block, the values in the
forward pass increase exponentially as we move through the network.
To see why, consider that we add the result of the processing in the residual block back
to the input. Each branch has some (uncorrelated) variability. Hence, the overall variance
Problem 11.4
increases when we recombine them. With ReLU activations and He initialization, the
expected variance is unchanged by the processing in each block. Consequently, when
we recombine with the input, the variance doubles (gure 11.6a), growing exponentially
with the number of residual blocks. This limits the possible network depth before oating
point precision is exceeded in the forward pass. A similar argument applies to the
gradients in the backward pass of the backpropagation algorithm.
Hence, residual networks still suer from unstable forward propagation and exploding
gradients even with He initialization. One approach that would stabilize the forward and
backward passes would be to use He initialization and then multiply the combined output
of each residual block by 1/
2 to compensate for the doubling (gure 11.6b). However,
it is more usual to use batch normalization.
11.4 Batch normalization
Batch normalization or BatchNorm shifts and rescales each activation h so that its mean
and variance across the batch B become values that are learned during training. First,
the empirical mean m
h
and standard deviation s
h
are computed:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
11.4 Batch normalization 193
Figure 11.6 Variance in residual networks. a) He initialization ensures that the
expected variance remains unchanged after a linear plus ReLU layer f
k
. Unfortu-
nately, in residual networks, the input of each block is added back to the output,
so the variance doubles at each layer (gray numbers indicate variance) and grows
exponentially. b) One approach would be to rescale the signal by 1/
2 between
each residual block. c) A second method uses batch normalization (BN) as the
rst step in the residual block and initializes the associated oset δ to zero and
scale γ to one. This transforms the input to each layer to have unit variance, and
with He initialization, the output variance will also be one. Now the variance
increases linearly with the number of residual blocks. A side-eect is that, at
initialization, later network layers are dominated by the residual connection and
are hence close to computing the identity.
m
h
=
1
|B|
X
i∈B
h
i
s
h
=
s
1
|B|
X
i∈B
(h
i
m
h
)
2
, (11.7)
where all quantities are scalars. Then we use these statistics to standardize the batch
Appendix C.2.4
Standardization
activations to have mean zero and unit variance:
h
i
h
i
m
h
s
h
+
ϵ
i B, (11.8)
where ϵ is a small number that prevents division by zero if h
i
is the same for every
member of the batch and s
h
= 0 .
Finally, the normalized variable is scaled by γ and shifted by δ:
h
i
γh
i
+ δ i B. (11.9)
Draft: please send errata to udlbookmail@gmail.com.
194 11 Residual networks
After this operation, the activations have mean δ and standard deviation γ across all
Problem 11.5
members of the batch. Both of these quantities are learned during training.
Batch normalization is applied independently to each hidden unit. In a standard
neural network with K layers, each containing D hidden units, there would be KD
Problem 11.6
learned osets δ and KD learned scales γ. In a convolutional network, the normalizing
statistics are computed over both the batch and the spatial position. If there were K
Notebook 11.3
BatchNorm
layers, each containing C channels, there would be KC osets and KC scales. At test
time, we do not have a batch from which we can gather statistics. To resolve this, the
statistics m
h
and s
h
are calculated across the whole training dataset (rather than just a
batch) and frozen in the nal network.
11.4.1 Costs and benets of batch normalization
Batch normalization makes the network invariant to rescaling the weights and biases that
contribute to each activation; if these are doubled, then the activations also double, the
estimated standard deviation s
h
doubles, and the normalization in equation 11.8 com-
pensates for these changes. This happens separately for each hidden unit. Consequently,
there will be a large family of weights and biases that all produce the same eect. Batch
normalization also adds two parameters, γ and δ, at every hidden unit, which makes the
model somewhat larger. Hence, it both creates redundancy in the weight parameters and
adds extra parameters to compensate for that redundancy. This is obviously inecient,
but batch normalization also provides several benets.
Stable forward propagation: If we initialize the osets δ to zero and the scales γ to one,
then each output activation will have unit variance. In a regular network, this ensures
the variance is stable during forward propagation at initialization. In a residual network,
the variance must still increase as we add a new source of variation to the input at each
layer. However, it will increase linearly with each residual block; the k
th
layer adds one
unit of variance to the existing variance of k (gure 11.6c).
At initialization, this has the side-eect that later layers make a smaller change to
the overall variation than earlier ones. The network is eectively less deep at the start of
training since later layers are close to computing the identity. As training proceeds, the
network can increase the scales γ in later layers and can control its own eective depth.
Higher learning rates: Empirical studies and theory both show that batch normaliza-
tion makes the loss surface and its gradient change more smoothly (i.e., reduces shat-
tered gradients). This means we can use higher learning rates as the surface is more
predictable. We saw in section 9.2 that higher learning rates improve test performance.
Regularization: We also saw in chapter 9 that adding noise to the training process
can improve generalization. Batch normalization injects noise because the normaliza-
tion depends on the batch statistics. The activations for a given training example are
normalized by an amount that depends on the other members of the batch and will be
slightly dierent at each training iteration.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
11.5 Common residual architectures 195
11.5 Common residual architectures
Residual connections are now a standard part of deep learning pipelines. This section
reviews some well-known architectures that incorporate them.
11.5.1 ResNet
Residual blocks were rst used in convolutional networks for image classication. The
resulting networks are known as residual networks, or
ResNets
for short. In ResNets, each
residual block contains a batch normalization operation, a ReLU activation function, and
a convolutional layer. This is followed by the same sequence again before being added
Problem 11.7
back to the input (gure 11.7a). Trial and error have shown that this order of operations
works well for image classication.
For very deep networks, the number of parameters may become undesirably large.
Bottleneck residual blocks make more ecient use of parameters using three convolutions.
The rst has a 1×1 kernel and reduces the number of channels. The second is a regular
3×3 kernel, and the third is another 1×1 kernel to increase the number of channels back
to the original amount (gure 11.7b). In this way, we can integrate information over a
3×3 pixel area using fewer parameters.
Problem 11.8
The ResNet-200 model (gure 11.8) contains 200 layers and was used for image clas-
sication on the ImageNet database (gure 10.15). The architecture resembles AlexNet
and VGG but uses bottleneck residual blocks instead of vanilla convolutional layers. As
with AlexNet and VGG, these are periodically interspersed with decreases in spatial
resolution and simultaneous increases in the number of channels. Here, the resolution is
decreased by downsampling using convolutions with stride two. The number of channels
is increased either by appending zeros to the representation or by using an extra 1×1
convolution. At the start of the network is a 7×7 convolutional layer, followed by a
downsampling operation. At the end, a fully connected layer maps the block to a vector
of length 1000. This is passed through a softmax layer to generate class probabilities.
The ResNet-200 model achieved a remarkable 4.8% error rate for the correct class
being in the top ve and 20.1% for identifying the correct class correctly. This compared
favorably with AlexNet (16.4%, 38.1%) and VGG (6.8%, 23.7%) and was one of the
rst networks to exceed human performance (5.1% for being in the top ve guesses).
However, this model was conceived in 2016 and is far from state-of-the-art. At the time
of writing, the best-performing model on this task has a 9.0% error for identifying the
class correctly (see gure 10.21). This and all the other current top-performing models
for image classication are now based on transformers (see chapter 12).
11.5.2 DenseNet
Residual blocks receive the output from the previous layer, modify it by passing it
through some network layers, and add it back to the original input. An alternative is
to concatenate the modied and original signals. This increases the representation size
Draft: please send errata to udlbookmail@gmail.com.
196 11 Residual networks
Figure 11.7 ResNet blocks. a) A standard block in the ResNet architecture con-
tains a batch normalization operation, followed by an activation function, and
a 3×3 convolutional layer. Then, this sequence is repeated. b). A bottleneck
ResNet block still integrates information over a 3×3 region but uses fewer pa-
rameters. It contains three convolutions. The rst 1×1 convolution reduces the
number of channels. The second 3×3 convolution is applied to the smaller rep-
resentation. A nal 1×1 convolution increases the number of channels again so
that it can be added back to the input.
Figure 11.8 ResNet-200 model. A standard 7×7 convolutional layer with stride
two is applied, followed by a MaxPool operation. A series of bottleneck residual
blocks follow (number in brackets is channels after rst 1×1 convolution), with
periodic downsampling and accompanying increases in the number of channels.
The network concludes with average pooling across all spatial positions and a
fully connected layer that maps to pre-softmax activations.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
11.5 Common residual architectures 197
Figure 11.9 DenseNet. This architecture uses residual connections to concatenate
the outputs of earlier layers to later ones. Here, the three-channel input image is
processed to form a 32-channel representation. The input image is concatenated
to this to give a total of 35 channels. This combined representation is processed
to create another 32-channel representation, and both earlier representations are
concatenated to this to create a total of 67 channels and so on.
(in terms of channels for a convolutional network), but an optional subsequent linear
transformation can map back to the original size (a 1×1 convolution for a convolutional
network). This allows the model to add the representations together, take a weighted
sum, or combine them in a more complex way.
The DenseNet architecture uses concatenation so that the input to a layer comprises
the concatenated outputs from all previous layers (gure 11.9). These are processed to
create a new representation that is itself concatenated with the previous representation
and passed to the next layer. This concatenation means there is a direct contribution
from earlier layers to the output, so the loss surface behaves reasonably.
In practice, this can only be sustained for a few layers because the number of channels
(and hence the number of parameters required to process them) becomes increasingly
large. This problem can be alleviated by applying a 1×1 convolution to reduce the
number of channels before the next 3×3 convolution is applied. In a convolutional
network, the input is periodically downsampled. Concatenation across the downsampling
makes no sense since the representations have dierent sizes. Consequently, the chain of
concatenation is broken at this point, and a smaller representation starts a new chain.
In addition, another bottleneck 1×1 convolution can be applied when the downsampling
occurs to control the representation size further.
This network performs competitively with ResNet models on image classication (see
gure 10.21); indeed, it can perform better for a comparable parameter count. This is
presumably because it can reuse processing from earlier layers more exibly.
11.5.3 U-Nets and hourglass networks
Section 10.5.3 described a semantic segmentation network that had an encoder-decoder or
hourglass structure. The encoder repeatedly downsamples the image until the receptive
elds are large and information is integrated from across the image. Then the decoder
Draft: please send errata to udlbookmail@gmail.com.
198 11 Residual networks
Figure 11.10 U-Net for segmenting HeLa cells. The U-Net has an encoder-decoder
structure, in which the representation is downsampled (orange blocks) and then
re-upsampled (blue blocks). The encoder uses regular convolutions, and the de-
coder uses transposed convolutions. Residual connections append the last repre-
sentation at each scale in the encoder to the rst representation at the same scale
in the decoder (orange arrows). The original U-Net used “valid” convolutions, so
the size decreased slightly with each layer, even without downsampling. Hence,
the representations from the encoder were cropped (dashed squares) before ap-
pending to the decoder. Adapted from Ronneberger et al. (2015).
upsamples it back to the size of the original image. The nal output is a probability
over possible object classes at each pixel. One drawback of this architecture is that
the low-resolution representation in the middle of the network must “remember” the
high-resolution details to make the nal result accurate. This is unnecessary if residual
connections transfer the representations from the encoder to their partner in the decoder.
The U-Net (gure 11.10) is an encoder-decoder architecture where the earlier repre-
sentations are concatenated to the later ones. The original implementation used “valid”
convolutions, so the spatial size decreases by two pixels each time a 3×3 convolutional
layer is applied. This means that the upsampled version is smaller than its counterpart
in the encoder, which must be cropped before concatenation. Subsequent implementa-
tions have used zero padding, where this cropping is unnecessary. Note that the U-Net
is completely convolutional, so after training, it can be run on an image of any size.
Problem 11.9
The U-Net was intended for segmenting medical images (gure 11.11) but has found
many other uses in computer graphics and vision. Hourglass networks are similar but
apply further convolutional layers in the skip connections and add the result back to the
decoder rather than concatenating it. A series of these models form a stacked hourglass
network that alternates between considering the image at local and global levels. Such
networks are used for pose estimation (gure 11.12). The system is trained to predict one
“heatmap” for each joint, and the estimated position is the maximum of each heatmap.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
11.6 Why do nets with residual connections perform so well? 199
Figure 11.11 Segmentation using U-Net in 3D. a) Three slices through a 3D
volume of mouse cortex taken by scanning electron microscope. b) A single U-
Net is used to classify voxels as being inside or outside neurites. Connected
regions are identied with dierent colors. c) For a better result, an ensemble of
ve U-Nets is trained, and a voxel is only classied as belonging to the cell if all
ve networks agree. Adapted from Falk et al. (2019).
11.6 Why do nets with residual connections perform so well?
Residual networks allow much deeper networks to be trained; it’s possible to extend the
ResNet architecture to 1000 layers and still train eectively. The improvement in image
classication performance was initially attributed to the additional network depth, but
two pieces of evidence contradict this viewpoint.
First, shallower, wider residual networks sometimes outperform deeper, narrower ones
with a comparable parameter count. In other words, better performance can sometimes
be achieved with a network with fewer layers but more channels per layer. Second, there
is evidence that the gradients during training do not propagate eectively through very
long paths in the unraveled network (gure 11.4b). In eect, a very deep network may
act more like a combination of shallower networks.
The current view is that residual connections add some value of their own, as well
as allowing deeper networks to be trained. This perspective is supported by the fact
that the loss surfaces of residual networks around a minimum tend to be smoother and
more predictable than those for the same network when the skip connections are removed
(gure 11.13). This may make it easier to learn a good solution that generalizes well.
11.7 Summary
Increasing network depth indenitely causes both training and test performance for image
classication to decrease. This may be because the gradient of the loss with respect to
Draft: please send errata to udlbookmail@gmail.com.
200 11 Residual networks
Figure 11.12 Stacked hourglass networks for pose estimation. a) The network
input is an image containing a person, and the output is a set of heatmaps, with
one heatmap for each joint. This is formulated as a regression problem where the
targets are heatmap images with small, highlighted regions at the ground-truth
joint positions. The peak of the estimated heatmap is used to establish each nal
joint position. b) The architecture consists of initial convolutional and residual
layers followed by a series of hourglass blocks. c) Each hourglass block consists
of an encoder-decoder network similar to the U-Net except that the convolutions
use zero padding, some further processing is done in the residual links, and these
links add this processed representation rather than concatenate it. Each blue
cuboid is itself a bottleneck residual block (gure 11.7b). Adapted from Newell
et al. (2016).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 201
Figure 11.13 Visualizing neural network loss surfaces. Each plot shows the loss
surface in two random directions in parameter space around the minimum found
by SGD for an image classication task on the CIFAR-10 dataset. These direc-
tions are normalized to facilitate side-by-side comparison. a) Residual net with 56
layers. b) Results from the same network without skip connections. The surface
is smoother with the skip connections. This facilitates learning and makes the
nal network performance more robust to minor errors in the parameters, so it
will likely generalize better. Adapted from Li et al. (2018b).
parameters early in the network changes quickly and unpredictably relative to the update
step size. Residual connections add the processed representation back to their own input.
Now each layer contributes directly to the output as well as indirectly, so propagating
gradients through many layers is not mandatory, and the loss surface is smoother.
Residual networks don’t suer from vanishing gradients but introduce an exponential
increase in the variance of the activations during forward propagation and corresponding
problems with exploding gradients. This is usually handled by adding batch normaliza-
tion, which compensates for the empirical mean and variance of the batch and then
shifts and rescales using learned parameters. If these parameters are initialized judi-
ciously, very deep networks can be trained. There is evidence that both residual links
and batch normalization make the loss surface smoother, which permits larger learning
rates. Moreover, the variability in the batch statistics adds a source of regularization.
Residual blocks have been incorporated into convolutional networks. They allow
deeper networks to be trained with commensurate increases in image classication per-
formance. Variations of residual networks include the DenseNet architecture, which
concatenates outputs of all prior layers to feed into the current layer, and U-Nets, which
incorporate residual connections into encoder-decoder models.
Notes
Residual connections: Residual connections were introduced by He et al. (2016a), who built
a network with 152 layers, which was eight times larger than VGG (gure 10.17), and achieved
state-of-the-art performance on the ImageNet classication task. Each residual block consisted
Draft: please send errata to udlbookmail@gmail.com.
202 11 Residual networks
of a convolutional layer followed by batch normalization, a ReLU activation, a second convolu-
tional layer, and second batch normalization. A second ReLU function was applied after this
block was added back to the main representation. This architecture was termed ResNet v1.
He et al. (2016b) investigated dierent variations of residual architectures, in which either (i)
processing could also be applied along the skip connection or (ii) after the two branches had
recombined. They concluded neither was necessary, leading to the architecture in gure 11.7,
which is sometimes termed a pre-activation residual block and is the backbone of ResNet v2.
They trained a network with 200 layers that improved further on the ImageNet classication
task (see gure 11.8). Since this time, new methods for regularization, optimization, and data
augmentation have been developed, and Wightman et al. (2021) exploit these to present a more
modern training pipeline for the ResNet architecture.
Why residual connections help: Residual networks certainly allow deeper networks to be
trained. Presumably, this is related to reducing shattered gradients (Balduzzi et al., 2017) at
the start of training and the smoother loss surface near the minima as depicted in gure 11.13
(Li et al., 2018b). Residual connections alone (i.e., without batch normalization) increase the
trainable depth of a network by roughly a factor of two (Sankararaman et al., 2020). With batch
normalization, very deep networks can be trained, but it is unclear that depth is critical for
performance. Zagoruyko & Komodakis (2016) showed that wide residual networks with only 16
layers outperformed all residual networks of the time for image classication. Orhan & Pitkow
(2017) propose a dierent explanation for why residual connections improve learning in terms
of eliminating singularities (places on the loss surface where the Hessian is degenerate).
Related architectures: Residual connections are a special case of highway networks (Srivas-
tava et al., 2015) which also split the computation into two branches and additively recombine.
Highway networks use a gating function that weights the inputs to the two branches in a way
that depends on the data itself, whereas residual networks send the data down both branches in
a straightforward manner. Xie et al. (2017) introduced the ResNeXt architecture, which places
a residual connection around multiple parallel convolutional branches.
Residual networks as ensembles: Veit et al. (2016) characterized residual networks as en-
sembles of shorter networks and depicted the “unraveled network” interpretation (gure 11.4b).
They provide evidence that this interpretation is valid by showing that deleting layers in a
trained network (and hence a subset of paths) only has a modest eect on performance. Con-
versely, removing a layer in a purely sequential network like VGG is catastrophic. They also
looked at the gradient magnitudes along paths of dierent lengths and showed that the gradient
vanishes in longer paths. In a residual network consisting of 54 blocks, almost all of the gradient
updates during training were from paths of length 5 to 17 blocks long, even though these only
constitute 0.45% of the total paths. It seems that adding more blocks eectively adds more
parallel shorter paths rather than creating a network that is truly deeper.
Regularization for residual networks: L2 regularization of the weights has a fundamentally
dierent eect in vanilla networks and residual networks without BatchNorm. In the former, it
encourages the output of the layer to be a constant function determined by the biases. In the
latter, it encourages the residual block to compute the identity plus a constant determined by
the biases.
Several regularization methods have been developed that are targeted specically at residual
architectures. ResDrop (Yamada et al., 2016), stochastic depth (Huang et al., 2016), and
RandomDrop (Yamada et al., 2019) all regularize residual networks by randomly dropping
residual blocks during the training process. In the latter case, the propensity for dropping a block
is determined by a Bernoulli variable, whose parameter is linearly decreased during training. At
test time, the residual blocks are added back in with their expected probability. These methods
are eectively versions of dropout, in which all the hidden units in a block are simultaneously
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 203
dropped in concert. In the multiple paths view of residual networks (gure 11.4b), they simply
remove some of the paths at each training step. Wu et al. (2018b) developed BlockDrop, which
analyzes an existing network and decides which residual blocks to use at runtime with the goal
of improving the eciency of inference.
Other regularization methods have been developed for networks with multiple paths inside
the residual block. Shake-shake (Gastaldi, 2017a,b) randomly re-weights the paths during the
forward and backward passes. In the forward pass, this can be viewed as synthesizing random
data, and in the backward pass, as injecting another form of noise into the training method.
ShakeDrop (Yamada et al., 2019) draws a Bernoulli variable that decides whether each block
will be subject to Shake-Shake or behave like a standard residual unit on this training step.
Batch normalization: Batch normalization was introduced by Ioe & Szegedy (2015) outside
of the context of residual networks. They showed empirically that it allowed higher learning
rates, increased convergence speed, and made sigmoid activation functions more practical (since
the distribution of outputs is controlled, so examples are less likely to fall in the saturated
extremes of the sigmoid). Balduzzi et al. (2017) investigated the activation of hidden units in
later layers of deep networks with ReLU functions at initialization. They showed that many such
hidden units were always active or always inactive regardless of the input but that BatchNorm
reduced this tendency.
Although batch normalization helps stabilize the forward propagation of signals through a
network, Yang et al. (2019) showed that it causes gradient explosion in ReLU networks without
skip connections, with each layer increasing the magnitude of the gradients by
p
π/(π 1)
1.21. This argument is summarized by Luther (2020). Since a residual network can be seen
as a combination of paths of dierent lengths (gure 11.4), this eect must also be present in
residual networks. Presumably, however, the benet of removing the 2
K
increases in magnitude
in the forward pass of a network with K layers outweighs the harm done by increasing the
gradients by 1.21
K
in the backward pass, so overall BatchNorm makes training more stable.
Variations of batch normalization: Several variants of BatchNorm have been proposed
(gure 11.14). BatchNorm normalizes each channel separately based on statistics gathered
across the batch. Ghost batch normalization or GhostNorm (Hoer et al., 2017) uses only part
of the batch to compute the normalization statistics, which makes them noisier and increases
the amount of regularization when the batch size is very large (gure 11.14b).
When the batch size is very small or the uctuations within a batch are very large (as is often the
case in natural language processing), the statistics in BatchNorm may become unreliable. Ioe
(2017) proposed batch renormalization, which keeps a running average of the batch statistics
and modies the normalization of any batch to ensure that it is more representative. Another
problem is that batch normalization is unsuitable for use in recurrent neural networks (networks
for processing sequences, in which the previous output is fed back as an additional input as we
move through the sequence (see gure 12.19). Here, the statistics must be stored at each step in
the sequence, and it’s unclear what to do if a test sequence is longer than the training sequences.
A third problem is that batch normalization needs access to the whole batch. However, this
may not be easily available when training is distributed across several machines.
Layer normalization or LayerNorm (Ba et al., 2016) avoids using batch statistics by normalizing
each data example separately, using statistics gathered across the channels and spatial position
(gure 11.14c). However, there is still a separate learned scale γ and oset δ per channel.
Group normalization or GroupNorm (Wu & He, 2018) is similar to LayerNorm but divides the
channels into groups and computes the statistics for each group separately across the within-
group channels and the spatial positions (gure 11.14d). Again, there are still separate scale and
oset parameters per channel. Instance normalization or InstanceNorm (Ulyanov et al., 2016)
takes this to the extreme where the number of groups is the same as the number of channels,
so each channel is normalized separately (gure 11.14e), using statistics gathered across spatial
Draft: please send errata to udlbookmail@gmail.com.
204 11 Residual networks
Figure 11.14 Normalization schemes. BatchNorm modies each channel sepa-
rately but adjusts each batch member in the same way based on statistics gath-
ered across the batch and spatial position. Ghost BatchNorm computes these
statistics from only part of the batch to make them more variable. LayerNorm
computes statistics for each batch member separately, based on statistics gath-
ered across the channels and spatial position. It retains a separate learned scaling
factor for each channel. GroupNorm normalizes within each group of channels
and also retains a separate scale and oset parameter for each channel. Instan-
ceNorm normalizes within each channel separately, computing the statistics only
across spatial position. Adapted from Wu & He (2018).
position alone. Salimans & Kingma (2016) investigated normalizing the network weights rather
than the activations, but this has been less empirically successful. Teye et al. (2018) introduced
Monte Carlo batch normalization, which can provide meaningful estimates of uncertainty in the
predictions of neural networks. A recent comparison of the properties of dierent normalization
schemes can be found in Lubana et al. (2021).
Why BatchNorm helps: BatchNorm helps control the initial gradients in a residual network
(gure 11.6c). However, the mechanism by which BatchNorm improves performance is not
well understood. The stated goal of Ioe & Szegedy (2015) was to reduce problems caused
by internal covariate shift, which is the change in the distribution of inputs to a layer caused
by updating preceding layers during the backpropagation update. However, Santurkar et al.
(2018) provided evidence against this view by articially inducing covariate shift and showing
that networks with and without BatchNorm performed equally well.
Motivated by this, they searched for another explanation for why BatchNorm should improve
performance. They showed empirically for the VGG network that adding batch normalization
decreases the variation in both the loss and its gradient as we move in the gradient direction.
In other words, the loss surface is both smoother and changes more slowly, which is why larger
learning rates are possible. They also provide theoretical proofs for both these phenomena
and show that for any parameter initialization, the distance to the nearest optimum is less for
networks with batch normalization.
Bjorck et al. (2018) also argue that BatchNorm improves
the properties of the loss landscape and allows larger learning rates.
Other explanations of why BatchNorm improves performance include decreasing the importance
of tuning the learning rate (Ioe & Szegedy, 2015; Arora et al., 2018). Indeed Li & Arora
(2019) show that using an exponentially increasing learning rate schedule is possible with batch
normalization. Ultimately, this is because batch normalization makes the network invariant to
the scales of the weight matrices (see Huszár, 2019, for an intuitive visualization).
Hoer et al. (2017) identied that BatchNorm has a regularizing eect due to statistical uc-
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 205
tuations from the random composition of the batch. They proposed using a ghost batch size,
in which the mean and standard deviation statistics are computed from a subset of the batch.
Large batches can now be used without losing the regularizing eect of the extra noise in smaller
batch sizes. Luo et al. (2018) investigate the regularization eects of batch normalization.
Alternatives to batch normalization: Although BatchNorm is widely used, it is not strictly
necessary to train deep residual nets; there are other ways of making the loss surface tractable.
Balduzzi et al. (2017) proposed the rescaling by
p
1/2 in gure 11.6b; they argued that it
prevents gradient explosion but does not resolve the problem of shattered gradients.
Other work has investigated rescaling the function’s output in the residual block before adding
it back to the input. For example, De & Smith (2020) introduce SkipInit, in which a learnable
scalar multiplier is placed at the end of each residual branch. This helps if this multiplier is
initialized to less than
p
1/K, where K is the number of residual blocks. In practice, they
suggest initializing this to zero. Similarly, Hayou et al. (2021) introduce Stable ResNet, which
rescales the output of the function in the k
th
residual block (before addition to the main branch)
by a constant λ
k
. They prove that in the limit of innite width, the expected gradient norm of
the weights in the rst layer is lower bounded by the sum of squares of the scalings λ
k
. They
investigate setting these to a constant
p
1/K, where K is the number of residual blocks and
show that it is possible to train networks with up to 1000 blocks.
Zhang et al. (2019a) introduce FixUp, in which every layer is initialized using He normalization,
but the last linear/convolutional layer of every residual block is set to zero. Now the initial
forward pass is stable (since each residual block contributes nothing), and the gradients do not
explode in the backward pass (for the same reason). They also rescale the branches so that the
magnitude of the total expected change in the parameters is constant regardless of the number
of residual blocks. These methods allow training of deep residual networks but don’t usually
achieve the same test performance as when using BatchNorm. This is probably because they
do not benet from the regularization induced by the noisy batch statistics. De & Smith (2020)
modify their method to induce regularization via dropout, which helps close this gap.
DenseNet and U-Net: DenseNet was rst introduced by Huang et al. (2017b), U-Net was
developed by Ronneberger et al. (2015), and stacked hourglass networks by Newell et al. (2016).
Of these architectures, U-Net has been the most extensively adapted. Çiçek et al. (2016) in-
troduced 3D U-Net, and Milletari et al. (2016) introduced V-Net, both of which extend U-Net
to process 3D data. Zhou et al. (2018) combine the ideas of DenseNet and U-Net in an archi-
tecture that downsamples and re-upsamples the image but also repeatedly uses intermediate
representations. U-Nets are commonly used in medical image segmentation (see Siddique et al.,
2021, for a review). However, they have been applied to other areas, including depth estimation
(Garg et al., 2016), semantic segmentation (Iglovikov & Shvets, 2018), inpainting (Zeng et al.,
2019), pansharpening (Yao et al., 2018), and image-to-image translation (Isola et al., 2017).
U-Nets are also a key component in diusion models (chapter 18).
Problems
Problem 11.1 Derive equation 11.5 from the network denition in equation 11.4.
Problem 11.2 Unraveling the four-block network in gure 11.4a produces one path of length
zero, four paths of length one, six paths of length two, four paths of length three, and one path
of length four. How many paths of each length would there be if with (i) three residual blocks
and (ii) ve residual blocks? Deduce the rule for K residual blocks.
Problem 11.3 Show that the derivative of the network in equation 11.5 with respect to the rst
layer f
1
[x] is given by equation 11.6.
Draft: please send errata to udlbookmail@gmail.com.
206 11 Residual networks
Figure 11.15 Computational graph for batch normalization (see problem 11.5).
Problem 11.4
Explain why the values in the two branches of the residual blocks in gure 11.6a
are uncorrelated. Show that the variance of the sum of uncorrelated variables is the sum of
their individual variances.
Problem 11.5
The forward pass for batch normalization given a batch of scalar values {z
i
}
I
i=1
consists of the following operations (gure 11.15):
f
1
= E[z
i
]
f
2i
= z
i
f
1
f
3i
= f
2
2i
f
4
= E[f
3i
]
f
5
=
p
f
4
+ ϵ
f
6
= 1/f
5
f
7i
= f
2i
× f
6
z
i
= f
7i
× γ + δ,
(11.10)
where E[z
i
] =
1
I
P
i
z
i
. Write Python code to implement the forward pass. Now derive the
algorithm for the backward pass. Work backward through the computational graph computing
the derivatives to generate a set of operations that computes z
i
/∂z
i
for every element in the
batch. Write Python code to implement the backward pass.
Problem 11.6 Consider a fully connected neural network with one input, one output, and ten
hidden layers, each of which contains twenty hidden units. How many parameters does this
network have? How many parameters will it have if we place a batch normalization operation
between each linear transformation and ReLU?
Problem 11.7
Consider applying an L2 regularization penalty to the weights in the convolu-
tional layers in gure 11.7a, but not to the scaling parameters of the subsequent BatchNorm
layers. What do you expect will happen as training proceeds?
Problem 11.8 Consider a convolutional residual block that contains a batch normalization oper-
ation, followed by a ReLU activation function, and then a 3×3 convolutional layer. If the input
and output both have 512 channels, how many parameters are needed to dene this block? Now
consider a bottleneck residual block that contains three batch normalization/ReLU/convolution
sequences. The rst uses a 1×1 convolution to reduce the number of channels from 512 to 128.
The second uses a 3×3 convolution with the same number of input and output channels. The
third uses a 1×1 convolution to increase the number of channels from 128 to 512 (see g-
ure 11.7b). How many parameters are needed to dene this block?
Problem 11.9 The U-Net is completely convolutional and can be run with any sized image after
training. Why do we not train with a collection of arbitrarily-sized images?
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Chapter 12
Transformers
Chapter 10 introduced convolutional networks, which are specialized for processing data
that lie on a regular grid. They are particularly suited to processing images, which have
a very large number of input variables, precluding the use of fully connected networks.
Each layer of a convolutional network employs parameter sharing so that local image
patches are processed similarly at every position in the image.
This chapter introduces transformers. These were initially targeted at natural lan-
guage processing (NLP) problems, where the network input is a series of high-dimensional
embeddings representing words or word fragments. Language datasets share some of the
characteristics of image data. The number of input variables can be very large, and the
statistics are similar at every position; it’s not sensible to re-learn the meaning of the
word dog at every possible position in a body of text. However, language datasets have
the complication that text sequences vary in length, and unlike images, there is no easy
way to resize them.
12.1 Processing text data
To motivate the transformer, consider the following passage:
The restaurant refused to serve me a ham sandwich because it only cooks vegetarian
food. In the end, they just gave me two slices of bread. Their ambiance was just as good
as the food and service.
The goal is to design a network to process this text into a representation suitable for
downstream tasks. For example, it might be used to classify the review as positive or
negative or to answer questions such as “Does the restaurant serve steak?”.
We can make three immediate observations. First, the encoded input can be surpris-
ingly large. In this case, each of the 37 words might be represented by an embedding
vector of length 1024, so the encoded input would be of length 37 × 1024 = 37888 even
for this small passage. A more realistically sized body of text might have hundreds or
even thousands of words, so fully connected neural networks are impractical.
Draft: please send errata to udlbookmail@gmail.com.
208 12 Transformers
Second, one of the dening characteristics of NLP problems is that each input (one or
more sentences) is of a dierent length; hence, it’s not even obvious how to apply a fully
connected network. These observations suggest that the network should share parameters
across words at dierent input positions, similarly to how convolutional networks share
parameters across dierent image positions.
Third, language is ambiguous; it is unclear from the syntax alone that the pronoun it
refers to the restaurant and not to the ham sandwich. To understand the text, the word
it should somehow be connected to the word restaurant. In the parlance of transformers,
the former word should pay attention to the latter. This implies that there must be
connections between the words and that the strength of these connections will depend
on the words themselves. Moreover, these connections need to extend across large text
spans. For example, the word their in the last sentence also refers to the restaurant.
12.2 Dot-product self-attention
The previous section argued that a model for processing text will (i) use parameter
sharing to cope with long input passages of diering lengths and (ii) contain connections
between word representations that depend on the words themselves. The transformer
acquires both properties by using dot-product self-attention.
A standard neural network layer f[x], takes a D × 1 input x and applies a linear
transformation followed by an activation function like a ReLU, so:
f[x] = ReLU[β + Ωx], (12.1)
where β contains the biases, and contains the weights.
A self-attention block sa[] takes N inputs x
1
, . . . , x
N
, each of dimension D ×1, and
returns N output vectors of the same size. In the context of NLP, each input represents
a word or word fragment. First, a set of values are computed for each input:
v
m
= β
v
+
v
x
m
, (12.2)
where β
v
R
D×1
and
v
R
D×D
represent biases and weights, respectively.
Then the n
th
output sa
n
[x
1
, . . . , x
N
] is a weighted sum of all the values v
1
, . . . , v
N
:
sa
n
[x
1
, . . . , x
N
] =
N
X
m=1
a[x
m
, x
n
]v
m
. (12.3)
The scalar weight a[x
m
, x
n
] is the attention that the n
th
output pays to input x
m
. The N
weights a[, x
n
] are non-negative and sum to one. Hence, self-attention can be thought
of as routing the values in dierent proportions to create each output (gure 12.1).
The following sections examine dot-product self-attention in more detail. First, we
consider the computation of the values and their subsequent weighting (equation 12.3).
Then we describe how to compute the attention weights a[x
m
, x
n
] themselves.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
12.2 Dot-product self-attention 209
Figure 12.1 Self-attention as routing. The self-attention mechanism takes N
inputs x
1
, . . . , x
N
R
D
(here N = 3 and D = 4) and processes each separately
to compute N value vectors. The n
th
output sa
n
[x
1
, . . . x
N
] (written as sa
n
[x
]
for short) is then computed as a weighted sum of the N value vectors, where the
weights are positive and sum to one. a) Output sa
1
[x
] is computed as a[x
1
, x
1
] =
0.1 times the rst value vector, a[x
2
, x
1
] = 0.3 times the second value vector,
and a[x
3
, x
1
] = 0.6 times the third value vector. b) Output sa
2
[x
] is computed
in the same way, but this time with weights of 0.5, 0.2, and 0.3. c) The weighting
for output sa
3
[x
] is dierent again. Each output can hence be thought of as a
dierent routing of the N values.
12.2.1 Computing and weighting values
Equation 12.2 shows that the same weights
v
R
D×D
and biases β
v
R
D
are applied
to each input x
R
D
. This computation scales linearly with the sequence length N,
so it requires fewer parameters than a fully connected network relating all DN inputs
to all DN outputs. The value computation can be viewed as a sparse matrix operation
with shared parameters (gure 12.2b).
The attention weights a[x
m
, x
n
] combine the values from dierent inputs. They
are also sparse since there is only one weight for each ordered pair of inputs (x
m
, x
n
),
regardless of the size of these inputs (gure 12.2c). It follows that the number of attention
Problem 12.1
weights has a quadratic dependence on the sequence length N, but is independent of the
length D of each input.
12.2.2 Computing attention weights
In the previous section, we saw that the outputs result from two chained linear transfor-
mations; the value vectors β
v
+
v
x
m
are computed independently for each input x
m
,
and these vectors are combined linearly by the attention weights a[x
m
, x
n
]. However, the
overall self-attention computation is nonlinear. As we’ll see shortly, the attention weights
are themselves nonlinear functions of the input. This is an example of a hypernetwork,
where one network branch computes the weights of another.
Draft: please send errata to udlbookmail@gmail.com.
210 12 Transformers
Figure 12.2 Self-attention for N = 3 inputs x
n
, each with dimension D = 4.
a) Each input x
n
is operated on independently by the same weights
v
(same
color equals same weight) and biases β
v
(not shown) to form the values β
v
+
v
x
n
. Each output is a linear combination of the values, with a shared attention
weight a[x
m
, x
n
] dening the contribution of the m
th
value to the n
th
output.
b) Matrix showing block sparsity of linear transformation
v
between inputs
and values. c) Matrix showing sparsity of attention weights relating values and
outputs.
To compute the attention, we apply two more linear transformations to the inputs:
q
n
=
β
q
+
q
x
n
k
m
= β
k
+
k
x
m
, (12.4)
where {q
n
} and {k
m
} are termed queries and keys, respectively. Then we compute dot
Appendix B.3.4
Dot product
products between the queries and keys and pass the results through a softmax function:
a[x
m
, x
n
] = softmax
m
k
T
q
n
=
exp
k
T
m
q
n
P
N
m
=1
exp
k
T
m
q
n
, (12.5)
so for each x
n
, they are positive and sum to one (gure 12.3). For obvious reasons, this
is known as dot-product self-attention.
The names “queries” and “keys” were inherited from the eld of information retrieval
and have the following interpretation: the dot product operation returns a measure of
similarity between its inputs, so the weights a[x
, x
n
] depend on the relative similarities
between the n
th
query and all of the keys. The softmax function means that the key
vectors “compete” with one another to contribute to the nal result. The queries and
keys must have the same dimensions. However, these can dier from the dimension of
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
12.2 Dot-product self-attention 211
Figure 12.3 Computing attention weights. a) Query vectors q
n
= β
q
+
q
x
n
and key vectors k
n
= β
k
+
k
x
n
are computed for each input x
n
. b) The dot
products between each query and the three keys are passed through a softmax
function to form non-negative attentions that sum to one. c) These route the
value vectors (gure 12.1) via the sparse matrix from gure 12.2c.
the values, which is usually the same size as the input, so the representation doesn’t
Problem 12.2
change size.
12.2.3 Self-attention summary
The n
th
output is a weighted sum of the same linear transformation v
= β
v
+
v
x
applied to all of the inputs, where these attention weights are positive and sum to one.
The weights depend on a measure of similarity between input x
n
and the other inputs.
There is no activation function, but the mechanism is nonlinear due to the dot-product
and a softmax operation used to compute the attention weights.
Note that this mechanism fullls the initial requirements. First, there is a single
shared set of parameters ϕ = {β
v
,
v
, β
q
,
q
, β
k
,
k
}. This is independent of the
Draft: please send errata to udlbookmail@gmail.com.
212 12 Transformers
Figure 12.4 Self-attention in matrix form. Self-attention can be implemented
eciently if we store the N input vectors x
n
in the columns of the D×N matrix X.
The input X is operated on separately by the query matrix Q, key matrix K, and
value matrix V. The dot products are then computed using matrix multiplication,
and a softmax operation is applied independently to each column of the resulting
matrix to calculate the attentions. Finally, the values are post-multiplied by the
attentions to create an output of the same size as the input.
number of inputs N, so the network can be applied to dierent sequence lengths. Second,
there are connections between the inputs (words), and the strength of these connections
depends on the inputs themselves via the attention weights.
12.2.4 Matrix form
The above computation can be written in a compact form if the N inputs x
n
form the
columns of the D × N matrix X. The values, queries, and keys can be computed as:
V[X] = β
v
1
T
+
v
X
Q[X] = β
q
1
T
+
q
X
K[X] = β
k
1
T
+
k
X, (12.6)
where 1 is an N × 1 vector containing ones. The self-attention computation is then:
Sa[X] = V[X] · Softmax
h
K[X]
T
Q[X]
i
, (12.7)
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
12.3 Extensions to dot-product self-attention 213
Figure 12.5 Positional encodings. The
self-attention architecture is equivariant
to permutations of the inputs. To en-
sure that inputs at dierent positions are
treated dierently, a positional encoding
matrix Π can be added to the data ma-
trix. Each column is dierent, so the po-
sitions can be distinguished. Here, the
position encodings use a predened pro-
cedural sinusoidal pattern (which can be
extended to larger values of N if neces-
sary). However, in other cases, they are
learned.
where the function Softmax[] takes a matrix and performs the softmax operation
independently on each of its columns (gure 12.4). In this formulation, we have explicitly
Notebook 12.1
Self-attention
included the dependence of the values, queries, and keys on the input X to emphasize
that self-attention computes a kind of triple product based on the inputs. However, from
now on, we will drop this dependence and just write:
Sa[X] = V · Softmax[K
T
Q]. (12.8)
12.3 Extensions to dot-product self-attention
In the previous section, we described self-attention. Here, we introduce three extensions
that are almost always used in practice.
12.3.1 Positional encoding
Observant readers will have noticed that the self-attention mechanism discards important
Problem 12.3
information: the computation is the same regardless of the order of the inputs x
n
.
More precisely, it is equivariant with respect to input permutations. However, order is
important when the inputs correspond to the words in a sentence. The sentence The
woman ate the raccoon has a dierent meaning than The raccoon ate the woman. There
are two main approaches to incorporating position information.
Absolute positional encodings: A matrix Π is added to the input X that encodes
positional information (gure 12.5). Each column of Π is unique and hence contains
information about the absolute position in the input sequence. This matrix can be
chosen by hand or learned. It may be added to the network inputs or at every network
layer. Sometimes it is added to X in the computation of the queries and keys but not
to the values.
Draft: please send errata to udlbookmail@gmail.com.
214 12 Transformers
Relative positional encodings: The input to a self-attention mechanism may be an
entire sentence, many sentences, or just a fragment of a sentence, and the absolute
position of a word is much less important than the relative position between two inputs.
Of course, this can be recovered if the system knows the absolute position of both,
but relative positional encodings encode this information directly. Each element of the
attention matrix corresponds to a particular oset between key position a and query
position b. Relative positional encodings learn a parameter π
a,b
for each oset and use
this to modify the attention matrix by adding these values, multiplying by them, or
using them to alter the attention matrix in some other way.
12.3.2 Scaled dot-product self-attention
The dot products in the attention computation can have large magnitudes and move
the arguments to the softmax function into a region where the largest value completely
dominates. Small changes to the inputs to the softmax function now have little eect on
Problem 12.4
the output (i.e., the gradients are very small), making the model dicult to train. To
prevent this, the dot products are scaled by the square root of the dimension D
q
of the
queries and keys (i.e., the number of rows in
q
and
k
, which must be the same):
Sa[X] = V · Softmax
"
K
T
Q
D
q
#
. (12.9)
This is known as scaled dot-product self-attention.
12.3.3 Multiple heads
Multiple self-attention mechanisms are usually applied in parallel, and this is known as
multi-head self-attention. Now H dierent sets of values, keys, and queries are computed:
V
h
= β
vh
1
T
+
vh
X
Q
h
= β
qh
1
T
+
qh
X
K
h
= β
kh
1
T
+
kh
X. (12.10)
The h
th
self-attention mechanism or head can be written as:
Sa
h
[X] = V
h
· Softmax
"
K
T
h
Q
h
D
q
#
, (12.11)
where we have dierent parameters {β
vh
,
vh
}, {β
qh
,
qh
}, and {β
kh
,
kh
} for each
head. Typically, if the dimension of the inputs x
m
is D and there are H heads, the values,
queries, and keys will all be of size D/H, as this allows for an ecient implementation.
Problem 12.5
The outputs of these self-attention mechanisms are vertically concatenated, and another
linear transform
c
is applied to combine them (gure 12.6):
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
12.4 Transformer layers 215
Figure 12.6 Multi-head self-attention. Self-attention occurs in parallel across
multiple “heads. Each has its own queries, keys, and values. Here two heads are
depicted, in the cyan and orange boxes, respectively. The outputs are vertically
concatenated, and another linear transformation
c
is used to recombine them.
MhSa[X] =
c
h
Sa
1
[X]
T
, Sa
2
[X]
T
, . . . , Sa
H
[X]
T
i
T
. (12.12)
Multiple heads seem to be necessary to make self-attention work well. It has been
Notebook 12.2
Multi-head
self-attention
speculated that they make the self-attention network more robust to bad initializations.
12.4 Transformer layers
Self-attention is just one part of a larger transformer layer. This consists of a multi-head
self-attention unit (which allows the word representations to interact with each other)
Draft: please send errata to udlbookmail@gmail.com.
216 12 Transformers
Figure 12.7 Transformer layer. The input consists of a D × N matrix containing
the D-dimensional word embeddings for each of the N input tokens. The output is
a matrix of the same size. The transformer layer consists of a series of operations.
First, there is a multi-head attention block, allowing the word embeddings to
interact with one another. This forms the processing of a residual block, so the
inputs are added back to the output. Second, a LayerNorm operation is applied.
Third, there is a second residual layer where the same fully connected neural
network is applied separately to each of the N word representations (columns).
Finally, LayerNorm is applied again.
followed by a fully connected network mlp[x
] (that operates separately on each word).
Both units are residual networks (i.e., their output is added back to the original input).
In addition, it is typical to add a LayerNorm operation after both the self-attention and
fully connected networks. This is similar to BatchNorm but uses statistics across the
tokens within a single input sequence to perform the normalization (section 11.4 and
gure 11.14). The complete layer can be described by the following series of operations
(gure 12.7):
X X + MhSa[X]
X LayerNorm[X]
x
n
x
n
+ mlp[x
n
] n {1, . . . , N }
X LayerNorm[X], (12.13)
where the column vectors x
n
are separately taken from the full data matrix X. In a real
network, the data passes through a series of these transformer layers.
12.5 Transformers for natural language processing
The previous section described the transformer layer. This section describes how it is
used in natural language processing (NLP) tasks. A typical NLP pipeline starts with a
tokenizer that splits the text into words or word fragments. Then each of these tokens
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
12.5 Transformers for natural language processing 217
Figure 12.8 Sub-word tokenization. a) A passage of text from a nursery rhyme.
The tokens are initially just the characters and whitespace (represented by an
underscore), and their frequencies are displayed in the table. b) At each iteration,
the sub-word tokenizer looks for the most commonly occurring adjacent pair of
tokens (in this case, se) and merges them. This creates a new token and decreases
the counts for the original tokens s and e. c) At the second iteration, the algorithm
merges e and the whitespace character_. Note that the last character of the rst
token to be merged cannot be whitespace, which prevents merging across words.
d) After 22 iterations, the tokens consist of a mix of letters, word fragments, and
commonly occurring words. e) If we continue this process indenitely, the tokens
eventually represent the full words. f) Over time, the number of tokens increases
as we add word fragments to the letters and then decreases again as we merge
these fragments. In a real situation, there would be a very large number of words,
and the algorithm would terminate when the vocabulary size (number of tokens)
reached a predetermined value. Punctuation and capital letters would also be
treated as separate input characters.
Draft: please send errata to udlbookmail@gmail.com.
218 12 Transformers
is mapped to a learned embedding. These embeddings are passed through a series of
transformer layers. We now consider each of these stages in turn.
12.5.1 Tokenization
A text processing pipeline begins with a tokenizer. This splits the text into smaller
constituent units (tokens) from a vocabulary of possible tokens. In the discussion above,
we have implied that these tokens represent words, but there are several diculties.
Inevitably, some words (e.g., names) will not be in the vocabulary.
It’s unclear how to handle punctuation, but this is important. If a sentence ends
in a question mark, we must encode this information.
The vocabulary would need dierent tokens for versions of the same word with
dierent suxes (e.g., walk, walks, walked, walking), and there is no way to clarify
that these variations are related.
One approach would be to use letters and punctuation marks as the vocabulary, but this
would mean splitting text into very small parts and requiring the subsequent network to
re-learn the relations between them.
In practice, a compromise between letters and full words is used, and the nal vo-
Notebook 12.3
Tokenization
cabulary includes both common words and word fragments from which larger and less
frequent words can be composed. The vocabulary is computed using a sub-word tok-
enizer such as byte pair encoding (gure 12.8) that greedily merges commonly occurring
sub-strings based on their frequency.
12.5.2 Embeddings
Each token in the vocabulary V is mapped to a unique word embedding, and the embed-
dings for the whole vocabulary are stored in a matrix
e
R
D×|V|
. To accomplish this,
the N input tokens are rst encoded in the matrix T R
|VN
, where the n
th
column
corresponds to the n
th
token and is a |V| × 1 one-hot vector (i.e., a vector where every
entry is zero except for the entry corresponding to the token, which is set to one). The
input embeddings are computed as X =
e
T, and
e
is learned like any other network
parameter (gure 12.9). A typical embedding size D is 1024, and a typical total vocab-
ulary size |V| is 30,000, so even before the main network, there are many parameters
in
e
to learn.
12.5.3 Transformer model
Finally, the embedding matrix X representing the text is passed through a series of K
transformer layers, called a transformer model. There are three types of transformer
models. An encoder transforms the text embeddings into a representation that can
support a variety of tasks. A decoder predicts the next token to continue the input
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
12.6 Encoder model example: BERT 219
Figure 12.9 The input embedding matrix X R
D×N
contains N embeddings of
length D and is created by multiplying a matrix
e
containing the embeddings
for the entire vocabulary with a matrix containing one-hot vectors in its columns
that correspond to the word or sub-word indices. The vocabulary matrix
e
is
considered a parameter of the model and is learned along with the other param-
eters. Note that the two embeddings for the word an in X are the same.
text. Encoder-decoders are used in sequence-to-sequence tasks, where one text string is
converted into another (e.g., machine translation). These variations are described in
sections 12.6–12.8, respectively.
12.6 Encoder model example: BERT
BERT is an encoder model that uses a vocabulary of 30,000 tokens. Input tokens are
converted to 1024-dimensional word embeddings and passed through 24 transformer
layers. Each contains a self-attention mechanism with 16 heads. The queries, keys, and
values for each head are of dimension 64 (i.e., the matrices
vh
,
qh
,
kh
are 1024 ×64).
The dimension of the single hidden layer in the fully connected networks is 4096. The
total number of parameters is 340 million. When BERT was introduced, this was
considered large, but it is now much smaller than state-of-the-art models.
Encoder models like BERT exploit transfer learning (section 9.3.6). During pre-
training, the parameters of the transformer architecture are learned using self-supervision
from a large corpus of text. The goal here is for the model to learn general information
about the statistics of language. In the ne-tuning stage, the resulting network is adapted
to solve a particular task using a smaller body of supervised training data.
Draft: please send errata to udlbookmail@gmail.com.
220 12 Transformers
Figure 12.10 Pre-training for BERT-like encoder. The input tokens (and a spe-
cial <cls> token denoting the start of the sequence) are converted to word em-
beddings. Here, these are represented as rows rather than columns, so the box
labeled “word embeddings” is X
T
. These embeddings are passed through a series
of transformer layers (orange connections indicate that every token attends to
every other token in these layers) to create a set of output embeddings. A small
fraction of the input tokens are randomly replaced with a generic <mask> token.
In pre-training, the goal is to predict the missing word from the associated output
embedding. As such, the output embeddings are passed through a softmax func-
tion, and the multiclass classication loss (section 5.24) is used. This task has
the advantage that it uses both the left and right context to predict the missing
word but has the disadvantage that it does not make ecient use of data; here,
seven tokens need to be processed to add two terms to the loss function.
12.6.1 Pre-training
In the pre-training stage, the network is trained using self-supervision. This allows the
use of enormous amounts of data without the need for manual labels. For BERT, the self-
supervision task consists of predicting missing words from sentences from a large internet
Problem 12.6
corpus (gure 12.10).
1
During training, the maximum input length is 512 tokens, and
the batch size is 256. The system is trained for a million steps, corresponding to roughly
50 epochs of the 3.3-billion word corpus.
Predicting missing words forces the transformer network to understand some syntax.
For example, it might learn that the adjective red is often found before nouns like house
or car but never before a verb like shout. It also allows the model to learn supercial
common sense about the world. For example, after training, the model will assign a
higher probability to the missing word train in the sentence The <mask> pulled into
the station than it would to the word peanut. However, the degree of “understanding”
this type of model can ever have is limited.
1
BERT also uses a secondary task that predicts whether two sentences were originally adjacent in
the text or not, but this only marginally improves performance.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
12.6 Encoder model example: BERT 221
Figure 12.11 After pre-training, the encoder is ne-tuned using manually labeled
data to solve a particular task. Usually, a linear transformation or a multi-layer
perceptron (MLP) is appended to the encoder to produce whatever output is
required. a) Example text classication task. In this sentiment classication
task, the <cls> token embedding is used to predict the probability that the
review is positive. b) Example word classication task. In this named entity
recognition problem, the embedding for each word is used to predict whether the
word corresponds to a person, place, or organization, or is not an entity.
12.6.2 Fine-tuning
In the ne-tuning stage, the model parameters are adjusted to specialize the network to
a particular task. An extra layer is appended onto the transformer network to convert
the output vectors to the desired output format. Examples include:
Text classication: In BERT, a special token known as the classication or <cls>
token is placed at the start of each string during pre-training. For text classication
tasks like sentiment analysis (in which the passage is labeled as having a positive or
negative emotional tone), the vector associated with the <cls> token is mapped to a
single number and passed through a logistic sigmoid (gure 12.11a). This contributes to
a standard binary cross-entropy loss (section 5.4).
Draft: please send errata to udlbookmail@gmail.com.
222 12 Transformers
Word classication: The goal of named entity recognition is to classify each word as
an entity type (e.g., person, place, organization, or no-entity). To this end, each input
embedding x
n
is mapped to an E × 1 vector where the E entries correspond to the E
entity types. This is passed through a softmax function to create probabilities for each
class, which contribute to a multiclass cross-entropy loss (gure 12.11b).
Text span prediction: In the SQuAD 1.1 question answering task, the question and a
passage from Wikipedia containing the answer are concatenated and tokenized. BERT
is then used to predict the text span in the passage that contains the answer. Each
token maps to two numbers indicating how likely it is that the text span begins and
ends at this location. The resulting two sets of numbers are put through two softmax
functions. The likelihood of any text span being the answer can be derived by combining
the probability of starting and ending at the appropriate places.
12.7 Decoder model example: GPT3
This section presents a high-level description of GPT3, an example of a decoder model.
The basic architecture is extremely similar to the encoder model and comprises a series
of transformer layers that operate on learned word embeddings. However, the goal is
dierent. The encoder aimed to build a representation of the text that could be ne-
tuned to solve a variety of more specic NLP tasks. Conversely, the decoder has one
purpose: to generate the next token in a sequence. It can generate a coherent text
passage by feeding the extended sequence back into the model.
12.7.1 Language modeling
GPT3 constructs an autoregressive language model. This is easiest to understand with
a concrete example. Consider the sentence It takes great courage to let yourself appear
weak. For simplicity, let’s assume that the tokens are the full words. The probability of
the full sentence is:
P r(It takes great courage to let yourself appear weak) =
P r(It) × P r(takes|It) × Pr(great|It takes) × P r(courage|It takes great) ×
P r(to|It takes great courage) ×P r(let|It takes great courage to) ×
P r(yourself|It takes great courage to let) ×
P r(appear|It takes great courage to let yourself) ×
P r(weak|It takes great courage to let yourself appear). (12.14)
More formally, an autoregressive model factors the joint probability P r(t
1
, t
2
, . . . , t
N
) of
the N observed tokens into an autoregressive sequence:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
12.7 Decoder model example: GPT3 223
P r(t
1
, t
2
, . . . , t
N
) = P r(t
1
)
N
Y
n=2
P r(t
n
|t
1
, . . . , t
n1
). (12.15)
The autoregressive formulation demonstrates the connection between maximizing the log
probability of the tokens in the loss function and the next token prediction task.
12.7.2 Masked self-attention
To train a decoder, we maximize the log probability of the input text under the autore-
gressive model. Ideally, we would pass in the whole sentence and compute all the log
probabilities and gradients simultaneously. However, this poses a problem; if we pass in
the full sentence, the term computing log [P r(great|It takes)] has access to both the an-
swer great and the right context courage to let yourself appear weak. Hence, the system
can cheat rather than learn to predict the following words and will not train properly.
Fortunately, the tokens only interact in the self-attention layers in a transformer
network. Hence, the problem can be resolved by ensuring that the attention to the
answer and the right context is zero. This can be achieved by setting the corresponding
dot products in the self-attention computation (equation 12.5) to negative innity before
they are passed through the softmax[] function. This is known as masked self-attention.
The eect is to make the weight of all the upward-angled arrows in gure 12.1 zero.
The entire decoder network operates as follows. The input text is tokenized, and the
tokens are converted to embeddings. The embeddings are passed into the transformer
network, but now the transformer layers use masked self-attention so that they can
only attend to the current and previous tokens. Each of the output embeddings can be
thought of as representing a partial sentence, and for each, the goal is to predict the next
token in the sequence. Consequently, after the transformer layers, a linear layer maps
each word embedding to the size of the vocabulary, followed by a softmax[] function
that converts these values to probabilities. During training, we aim to maximize the sum
of the log probabilities of the next token in the ground truth sequence at every position
using a standard multiclass cross-entropy loss (gure 12.12).
12.7.3 Generating text from a decoder
The autoregressive language model is the rst example of a generative model discussed
in this book. Since it denes a probability model over text sequences, it can be used
to sample new examples of plausible text. To generate from the model, we start with
an input sequence of text (which might be just a special <start> token indicating the
beginning of the sequence) and feed this into the network, which then outputs the proba-
bilities over possible subsequent tokens. We can then either pick the most likely token or
sample from this probability distribution. The new extended sequence can be fed back
into the decoder network that outputs the probability distribution over the next token.
By repeating this process, we can generate large bodies of text. The computation can
be made quite ecient as prior embeddings do not depend on subsequent ones due to
Draft: please send errata to udlbookmail@gmail.com.
224 12 Transformers
Figure 12.12 Training GPT3-type decoder network. The tokens are mapped to
word embeddings with a special <start> token at the beginning of the sequence.
The embeddings are passed through a series of transformer layers that use masked
self-attention. Here, each position in the sentence can only attend to its own
embedding and those of tokens earlier in the sequence (orange connections). The
goal at each position is to maximize the probability of the following ground truth
token in the sequence. In other words, at position one, we want to maximize the
probability of the token It; at position two, we want to maximize the probability
of the token takes; and so on. Masked self-attention ensures the system cannot
cheat by looking at subsequent inputs. The autoregressive task has the advantage
of making ecient use of the data since every word contributes a term to the loss
function. However, it only exploits the left context of each word.
the masked self-attention. Hence, much of the earlier computation can be recycled as we
Problem 12.7
generate subsequent tokens.
In practice, many strategies can make the output text more coherent. For example,
Notebook 12.4
Decoding
strategies
beam search keeps track of multiple possible sentence completions to nd the overall most
likely (which is not necessarily found by greedily choosing the most likely next word at
each step). Top-k sampling randomly draws the next word from only the top-K most
likely possibilities to prevent the system from accidentally choosing from the long tail of
low-probability tokens and leading to an unnecessary linguistic dead end.
12.7.4 GPT3 and few-shot learning
Large language models like GPT3 apply these ideas on a massive scale. In GPT3, the
sequence lengths are 2048 tokens long, and the total batch size is 3.2 million tokens.
There are 96 transformer layers (some of which implement a sparse version of attention),
each processing a word embedding of size 12288. There are 96 heads in the self-attention
layers, and the value, query, and key dimension is 128. It is trained with 300 billion
tokens and contains 175 billion parameters.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
12.7 Decoder model example: GPT3 225
Here’s an example of completing text from the GPT3 model where the text provided
to the model is in orange, and the generated text is in cyan:
Understanding Deep Learning is a new textbook from MIT Press by Simon Prince that’s de-
signed to oer an accessible, broad introduction to the eld. Deep learning is a branch of
machine learning that is concerned with algorithms that learn from data that is unstructured
or unlabeled. The book is divided into four sections:
1. Introduction to deep learning
2. Deep learning architecture
3. Deep learning algorithms
4. Applications of deep learning
The rst section oers an introduction to deep learning, including its history and origins.
The second section covers deep learning architecture, discussing various types of neural net-
works and their applications. The third section dives into deep learning algorithms, including
supervised and unsupervised learning, reinforcement learning, and more. The fourth section
applies deep learning to various domains, such as computer vision, natural language processing,
and robotics.
This continuation is plausible, albeit only partially accurate.
One surprising property of learning models on this scale is that they can perform many
tasks without ne-tuning. If we provide several examples of correct question/answer pairs
and then another question, they often answer the nal question correctly by completing
the sequence. One example is correcting English grammar:
Poor English input: I eated the purple berries.
Good English output: I ate the purple berries.
Poor English input: Thank you for picking me as your designer. I’d appreciate it.
Good English output: Thank you for choosing me as your designer. I appreciate it.
Poor English input: The mentioned changes have done. or I did the alteration that you
requested. or I changed things you wanted and did the modications.
Good English output: The requested changes have been made. or I made the alteration that
you requested. or I changed things you wanted and made the modications.
Poor English input: I’d be more than happy to work with you in another project.
Good English output: I’d be more than happy to work with you on another project.
(result from Brown et al., 2020)
Here, the text containing the paired examples in orange was provided as context for
GPT3, and the system then generated the correct answer in cyan. This phenomenon
extends to many situations, including generating code snippets based on natural language
descriptions, arithmetic, translating between languages, and answering questions about
text passages. Consequently, it is argued that enormous language models are few-shot
learners; they can learn to do novel tasks based on just a few examples. However,
performance is erratic in practice, and the extent to which it is extrapolating from
learned examples rather than merely interpolating or copying verbatim is unclear.
Draft: please send errata to udlbookmail@gmail.com.
226 12 Transformers
Figure 12.13 Encoder-decoder architecture. Two sentences are passed to the
system with the goal of translating the rst into the second. a) The rst sentence
is passed through a standard encoder. b) The second sentence is passed through a
decoder that uses masked self-attention but also attends to the output embeddings
of the encoder using cross-attention (orange rectangle). The loss function is the
same as for the decoder model; we want to maximize the probability of the next
word in the output sequence.
12.8 Encoder-decoder model example: machine translation
Translation between languages is an example of a sequence-to-sequence task. This re-
quires an encoder (to compute a good representation of the source sentence) and a
decoder (to generate the sentence in the target language). This task can be tackled
using an encoder-decoder model.
Consider translating from English to French. The encoder receives the sentence
in English and processes it through a series of transformer layers to create an output
representation for each token. During training, the decoder receives the ground truth
translation in French and passes it through a series of transformer layers that use masked
self-attention and predict the following word at each position. However, the decoder
layers also attend to the output of the encoder. Consequently, each French output word is
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
12.9 Transformers for long sequences 227
Figure 12.14 Cross-attention. The ow of computation is the same as in stan-
dard self-attention. However, the queries are calculated from the decoder embed-
dings X
dec
, and the keys and values from the encoder embeddings X
enc
. In the
context of translation, the encoder contains information about the source lan-
guage, and the decoder contains information about the target language statistics.
conditioned on the previous output words and the source English sentence (gure 12.13).
This is achieved by modifying the transformer layers in the decoder. Originally,
these consisted of a masked self-attention layer followed by a neural network applied
individually to each embedding (gure 12.12). A new self-attention layer is added be-
tween these two components, in which the decoder embeddings attend to the encoder
embeddings. This uses a version of self-attention known as encoder-decoder attention or
cross-attention, where the queries are computed from the decoder embeddings and the
keys and values from the encoder embeddings (gure 12.14).
12.9 Transformers for long sequences
Since each token in a transformer encoder model interacts with every other token, the
computational complexity scales quadratically with the length of the sequence. For a
decoder model, each token only interacts with previous tokens, so there are roughly
half the number of interactions, but the complexity still scales quadratically. These
relationships can be visualized as interaction matrices (gure 12.15a–b).
This quadratic increase in the amount of computation ultimately limits the length of
sequences that can be used. Many methods have been developed to extend the trans-
Draft: please send errata to udlbookmail@gmail.com.
228 12 Transformers
Figure 12.15 Interaction matrices for self-attention. a) In an encoder, every token
interacts with every other token, and computation expands quadratically with the
number of tokens. b) In a decoder, each token only interacts with the previous
tokens, but complexity is still quadratic. c) Complexity can be reduced by using
a convolutional structure (encoder case). d) Convolutional structure for decoder
case. e–f) Convolutional structure with dilation rate of two and three (decoder
case). g) Another strategy is to allow selected tokens to interact with all the
other tokens (encoder case) or all the previous tokens (decoder case pictured).
h) Alternatively, global tokens can be introduced (left two columns and top two
rows). These interact with all of the tokens as well as with each other.
former to cope with longer sequences. One approach is to prune the self-attention in-
teractions or, equivalently, to sparsify the interaction matrix (gures 12.15c-h). For
example, this can be restricted to a convolutional structure so that each token only in-
teracts with a few neighboring tokens. Across multiple layers, tokens still interact at
larger distances as the receptive eld expands. As for convolution in images, the kernel
can vary in size and dilation rate.
A pure convolutional approach requires many layers to integrate information over
large distances. One way to speed up this process is to allow select tokens (perhaps at
the start of every sentence) to attend to all other tokens (encoder model) or all previous
tokens (decoder model). A similar idea is to have a small number of global tokens that
connect to all the other tokens and themselves. Like the <cls> token, these do not
represent any word but serve to provide long-distance connections.
12.10 Transformers for images
Transformers were initially developed for text data. Their enormous success in this area
led to experimentation on images. This was not obviously a promising idea for two
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
12.10 Transformers for images 229
reasons. First, there are many more pixels in an image than words in a sentence, so the
quadratic complexity of self-attention poses a practical bottleneck. Second, convolutional
nets have a good inductive bias because each layer is equivariant to spatial translation,
and they take into account the 2D structure of the image. However, this must be learned
in a transformer network.
Regardless of these apparent disadvantages, transformer networks for images have
now eclipsed the performance of convolutional networks for image classication and other
tasks. This is partly because of the enormous scale at which they can be constructed
and the large amounts of data that can be used to pre-train the networks. This section
describes transformer models for images.
12.10.1 ImageGPT
ImageGPT is a transformer decoder; it builds an autoregressive model of image pixels
that ingests a partial image and predicts the subsequent pixel value. The quadratic
complexity of the transformer network means that the largest model (which contained
6.8 billion parameters) could still only operate on 64×64 images. Moreover, to make this
tractable, the original 24-bit RGB color space had to be quantized into a nine-bit color
space, so the system ingests (and predicts) one of 512 possible tokens at each position.
Images are naturally 2D objects, but ImageGPT simply learns a dierent positional
encoding at each pixel. Hence it must learn that each pixel has a close relationship with
its preceding neighbors and also with nearby pixels in the row above. Figure 12.16 shows
example generation results.
The internal representation of this decoder was used as a basis for image classication.
The nal pixel embeddings are averaged, and a linear layer maps these to activations
which are passed through a softmax layer to predict class probabilities. The system is pre-
trained on a large corpus of web images and then ne-tuned on the ImageNet database
resized to 48 ×48 pixels using a loss function that contains both a cross-entropy term for
image classication and a generative loss term for predicting the pixels. Despite using a
large amount of external training data, the system achieved only a 27.4% top-1 error rate
on ImageNet (gure 10.15). This was less than convolutional architectures of the time
(see gure 10.21) but is still impressive given the small input image size; unsurprisingly,
it fails to classify images where the target object is small or thin.
12.10.2 Vision Transformer (ViT)
The Vision Transformer tackled the problem of image resolution by dividing the image
into 16×16 patches (gure 12.17). Each patch is mapped to a lower dimension via a
Problem 12.8
learned linear transformation, and these representations are fed into the transformer
network. Once again, standard 1D positional encodings are learned.
This is an encoder model with a <cls> token (see gures 12.10–12.11). However,
unlike BERT, it uses supervised pre-training on a large database of 303 million labeled
images from 18,000 classes. The <cls> token is mapped via a nal network layer to
create activations that are fed into a softmax function to generate class probabilities.
After pre-training, the system is applied to the nal classication task by replacing this
Draft: please send errata to udlbookmail@gmail.com.
230 12 Transformers
Figure 12.16 ImageGPT. a) Images generated from the autoregressive ImageGPT
model. The top-left pixel is drawn from the estimated empirical distribution at
this position. Subsequent pixels are generated in turn, conditioned on the previous
ones, working along the rows until the bottom-right of the image is reached. For
each pixel, the transformer decoder generates a conditional distribution as in
equation 12.15, and a sample is drawn. The extended sequence is then fed back
into the network to generate the next pixel, and so on. b) Image completion.
In each case, the lower half of the image is removed (top row), and ImageGPT
completes the remaining part pixel by pixel (three dierent completions shown).
Adapted from https://openai.com/blog/image-gpt/.
nal layer with one that maps to the desired number of classes and is ne-tuned.
For the ImageNet benchmark, this system achieved an 11.45% top-1 error rate. How-
ever, it did not perform as well as the best contemporary convolutional networks without
supervised pre-training. The strong inductive bias of convolutional networks can only
be superseded by employing extremely large amounts of training data.
12.10.3 Multi-scale vision transformers
The Vision Transformer diers from convolutional architectures in that it operates on
a single scale. Several transformer models that process the image at multiple scales
have been proposed. Similarly to convolutional networks, these generally start with
high-resolution patches and few channels and gradually decrease the resolution while
simultaneously increasing the number of channels.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
12.10 Transformers for images 231
Figure 12.17 Vision transformer. The Vision Transformer (ViT) breaks the image
into a grid of patches (16×16 in the original implementation). Each of these
is projected via a learned linear transformation to become a patch embedding.
These patch embeddings are fed into a transformer encoder network, and the
<cls> token is used to predict the class probabilities.
Figure 12.18 Shifted window (SWin) transformer (Liu et al., 2021c). a) Origi-
nal image. b) The SWin transformer breaks the image into a grid of windows
and each of these windows into a sub-grid of patches. The transformer network
applies self-attention to the patches within each window independently. c) Each
alternate layer shifts the windows so that the subsets of patches that interact
with one another change, and information can propagate across the whole image.
d) After several layers, the 2 ×2 blocks of patch representations are concatenated
to increase the eective patch (and window) size. e) Alternate layers use shifted
windows at this new lower resolution. f) Eventually, the resolution is such that
there is just a single window, and the patches span the entire image.
Draft: please send errata to udlbookmail@gmail.com.
232 12 Transformers
A representative example of a multi-scale transformer is the shifted-window or SWin
transformer. This is an encoder transformer that divides the image into patches and
groups these patches into a grid of windows within which self-attention is applied in-
dependently (gure 12.18). These windows are shifted in adjacent transformers, so the
eective receptive eld at a given patch can expand beyond the window border.
The scale is reduced periodically by concatenating features from non-overlapping 2×2
patches and applying a linear transformation that maps these concatenated features to
twice the original number of channels. This architecture does not have a <cls> token
but instead averages the output features at the last layer. These are then mapped via a
linear layer to the desired number of classes and passed through a softmax function to
output class probabilities. At the time of writing, the most sophisticated version of this
architecture achieves a 9.89% top-1 error rate on the ImageNet database.
A related idea is periodically to integrate information from across the whole image.
Dual attention vision transformers (DaViT) alternate two types of transformers. In the
rst, image patches attend to one another, and the self-attention computation uses all
the channels. In the second, the channels attend to one another, and the self-attention
computation uses all the image patches. This architecture reaches a 9.60% top-1 error
Problem 12.9
rate on ImageNet and is close to the state-of-the-art at the time of writing.
12.11 Summary
This chapter introduced self-attention and the transformer architecture. Encoder, de-
coder, and encoder-decoder models were then described. The transformer operates on
sets of high-dimensional embeddings. It has a low computational complexity per layer,
and much of the computation can be performed in parallel using the matrix form. Since
every input embedding interacts with every other, it can describe long-range dependen-
cies in text. Ultimately, the computation scales quadratically with the sequence length;
one approach to reducing the complexity is sparsifying the interaction matrix.
The training of transformers with very large unlabeled datasets is the rst example
of unsupervised learning (learning without labels) in this book. Encoders learn a repre-
sentation that can be used for other tasks by predicting missing tokens. Decoders build
an autoregressive model over the inputs and are the rst example of a generative model
in this book. The generative decoders can be used to create new data examples.
Chapter 13 considers networks for processing graph data. These have connections
with transformers in that the nodes of the graph attend to one another in each network
layer. Chapters 14–18 return to unsupervised learning and generative models.
Notes
Natural language processing: Transformers were developed for natural language processing
(NLP) tasks. This is an enormous area that deals with text analysis, categorization, generation,
and manipulation. Example tasks include part of speech tagging, translation, text classication,
entity recognition (people, places, companies, etc.), text summarization, question answering,
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 233
Figure 12.19 Recurrent neural networks (RNNs). The word embeddings are
passed sequentially through a series of identical neural networks. Each network
has two outputs; one is the output embedding, and the other (orange arrows)
feeds back into the next neural network, along with the next word embedding.
Each output embedding contains information about the word itself and its con-
text in the preceding sentence fragment. In principle, the nal output contains
information about the entire sentence and could be used to support classication
tasks similarly to the <cls> token in a transformer encoder model. However,
RNNs sometimes gradually “forget” about tokens that are further back in time.
word sense disambiguation, and document clustering. NLP was initially tackled by rule-based
methods that exploited the structure and statistics of grammar. See Manning & Schutze (1999)
and Jurafsky & Martin (2000) for early approaches.
Recurrent neural networks: Before the introduction of transformers, many state-of-the-art
NLP applications used recurrent neural networks, or RNNs for short (gure 12.19). The term
“recurrent” was introduced by Rumelhart et al. (1985), but the main idea dates to at least
Minsky & Papert (1969). RNNs ingest a sequence of inputs (words in NLP) one at a time.
At each step, the network receives both the new input and a hidden representation computed
from the previous time step (the recurrent connection). The nal output contains information
about the whole input. This representation can then support NLP tasks like classication or
translation. They have also been used in a decoding context in which generated tokens are
fed back into the model to form the next input to the sequence. For example, the PixelRNN
(Van den Oord et al., 2016c) used RNNs to build an autoregressive model of images.
From RNNs to transformers: One of the problems with RNNs is that they can forget in-
formation that is further back in the sequence. More sophisticated versions of this architecture,
such as long short-term memory networks or LSTMs (Hochreiter & Schmidhuber, 1997b) and
gated recurrent units or GRUs (Cho et al., 2014; Chung et al., 2014) partially addressed this
problem. However, in machine translation, the idea emerged that all of the intermediate rep-
resentations in the RNN could be exploited to produce the output sentence. Moreover, certain
output words should attend more to certain input words according to their relation (Bahdanau
et al., 2015). This ultimately led to dispensing with the recurrent structure and replacing it with
the encoder-decoder transformer (Vaswani et al., 2017). Here input tokens attend to one another
(self-attention), output tokens attend to those earlier in the sequence (masked self-attention),
and output tokens also attend to the input tokens (cross-attention). A formal algorithmic de-
scription of the transformer can be found in Phuong & Hutter (2022), and a survey of work can
be found in Lin et al. (2022). The literature should be approached with caution, as many en-
hancements to transformers do not make meaningful performance improvements when carefully
assessed in controlled experiments (Narang et al., 2021).
Draft: please send errata to udlbookmail@gmail.com.
234 12 Transformers
Applications: Models based on self-attention and/or the transformer architecture have been
applied to text sequences (Vaswani et al., 2017), image patches (Dosovitskiy et al., 2021),
protein sequences (Rives et al., 2021), graphs (Veličković et al., 2019), database schema (Xu
et al., 2021b), speech (Wang et al., 2020c), mathematical integration when formulated as a
translation problem (Lample & Charton, 2020), and time series (Wu et al., 2020b). However,
their most celebrated successes have been in building language models and, more recently, as a
replacement for convolutional networks in computer vision.
Large language models: Vaswani et al. (2017) targeted translation tasks, but transformers
are now more usually used to build either pure encoder or pure decoder models, the most famous
of which are BERT (Devlin et al., 2019) and GPT2/GPT3 (Radford et al., 2019; Brown et al.,
2020), respectively. These models are usually tested against benchmarks like GLUE (Wang
et al., 2019b), which includes the SQuAD question-answering task (Rajpurkar et al., 2016)
described in section 12.6.2, SuperGLUE (Wang et al., 2019a) and BIG-bench (Srivastava et al.,
2022), which combine many NLP tasks to create an aggregate score for measuring language
ability. Decoder models are generally not ne-tuned for these tasks but can perform well anyway
when given a few examples of questions and answers and asked to complete the text from the
next question. This is referred to as few-shot learning (Brown et al., 2020).
Since GPT3, many decoder language models have been released with steady improvement in
few-shot results. These include GLaM (Du et al., 2022), Gopher (Rae et al., 2021), Chinchilla
(Homann et al., 2023), Megatron-Turing NLG (Smith et al., 2022), and LaMDa (Thoppilan
et al., 2022). Most of the performance improvement is attributable to increased model size,
using sparsely activated modules, and exploiting larger datasets. At the time of writing, the
most recent model is PaLM (Chowdhery et al., 2022), which has 540 billion parameters and
was trained on 780 billion tokens across 6144 processors. Interestingly, since text is highly
compressible, this model has more than enough capacity to memorize the entire training dataset.
This is true for many language models. Many bold statements have been made about how large
language models exceed human performance. This is probably true for some tasks, but such
statements should be treated with caution (see Ribeiro et al.,2021; McCoy et al., 2019; Bowman
& Dahl, 2021; and Dehghani et al., 2021).
These models have considerable world knowledge. For example, in section 12.7.4, the model
knows key facts about deep learning, including that it is a type of machine learning with
associated algorithms and applications. Indeed, one such model has been mistakenly identied
as being sentient (Clark, 2022). However, there are persuasive arguments that the degree of
“understanding” this type of model can ever have is limited (Bender & Koller, 2020).
Tokenizers: Schuster & Nakajima (2012) and Sennrich et al. (2015) introduced WordPiece
and byte pair encoding (BPE), respectively. Both methods greedily merge pairs of tokens based
on their frequency of adjacency (gure 12.8), with the main dierence being how the initial
tokens are chosen. For example, in BPE, the initial tokens are characters or punctuation with
a special token to denote whitespace. The merges cannot occur over the whitespace. As the
algorithm proceeds, new tokens are formed by combining characters recursively so that sub-
word and word tokens emerge. The unigram language model (Kudo, 2018) generates several
possible candidate merges and chooses the best one based on the likelihood in a language model.
Provilkov et al. (2020) develop BPE dropout, which generates the candidates more eciently
by introducing randomness into the process of counting frequencies. Versions of both byte pair
encoding and the unigram language model are included in the SentencePiece library (Kudo &
Richardson, 2018), which works directly on Unicode characters and can work with any language.
He et al. (2020) introduce a method that treats the sub-word segmentation as a latent variable
that should be marginalized out for learning and inference.
Decoding algorithms: Transformer decoder models take a body of text and return a prob-
ability over the next token. This is then added to the preceding text, and the model is run
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 235
again. The process of choosing tokens from these probability distributions is known as decoding.
Näive ways to do this would be to either (i) greedily choose the most likely token or (ii) choose
a token randomly according to the distribution. However, neither of these methods works well
in practice. In the former case, the results may be very generic, and the latter case may lead
to degraded quality outputs (Holtzman et al., 2020). This is partly because, during training,
the model was only exposed to sequences of ground truth tokens (known as teacher forcing) but
sees its own output when deployed.
It is not computationally feasible to try every combination of tokens in the output sequence,
but it is possible to maintain a xed number of parallel hypotheses and choose the most likely
overall sequence. This is known as beam search. Beam search tends to produce many similar
hypotheses and has been modied to investigate more diverse sequences (Vijayakumar et al.,
2016; Kulikov et al., 2018). One possible problem with random sampling is that there is a very
long tail of unlikely following words that collectively have a signicant probability. This has
led to the development of top-K sampling, in which tokens are sampled from only the K most
likely hypotheses (Fan et al., 2018). Top-K sampling still sometimes allows unreasonable token
choices when there are only a few high-probability choices. To resolve this problem, Holtzman
et al. (2020) proposed nucleus sampling, in which tokens are sampled from a xed proportion of
the total probability mass. El Asri & Prince (2020) discuss decoding algorithms in more depth.
Types of attention: Scaled dot-product attention (Vaswani et al., 2017) is just one of a
family of attention mechanisms that includes additive attention (Bahdanau et al., 2015), multi-
plicative attention (Luong et al., 2015), key-value attention (Daniluk et al., 2017), and memory-
compressed attention (Liu et al., 2019c). Zhai et al. (2021) constructed “attention-free” trans-
formers, in which the tokens interact in a way that does not have quadratic complexity. Multi-
head attention was also introduced by Vaswani et al. (2017). Interestingly, it appears that most
of the heads can be pruned after training without critically aecting the performance (Voita
et al., 2019); it has been suggested that their role is to guard against bad initializations. Hu et al.
(2018b) propose squeeze-and-excitation networks, attention-like mechanisms that re-weight the
channels in a convolutional layer based on globally computed features.
Relationship of self-attention to other models: The self-attention computation has close
connections to other models. First, it is an example of a hypernetwork (Ha et al., 2017) in that
it uses one part of the network to choose the weights of another part: the attention matrix forms
the weights of a sparse network layer that maps the values to the outputs (gure 12.3). The
synthesizer (Tay et al., 2021) simplies this idea by simply using a neural network to create each
row of the attention matrix from the corresponding input. Even though the input tokens no
longer interact with each other to create the attention weights, this works surprisingly well. Wu
et al. (2019) present a similar system that produces an attention matrix with a convolutional
structure so the tokens attend to their neighbors. The gated multi-layer perceptron (Wu et al.,
2019) computes a matrix that pointwise multiplies the values and hence modies them without
mixing them. Transformers are also closely related to fast weight memory systems, which were
the intellectual forerunners of hypernetworks (Schlag et al., 2021).
Self-attention can also be thought of as a routing mechanism (gure 12.1), and from this view-
point, there is a connection to capsule networks (Sabour et al., 2017). These capture hierarchical
relations in images; lower network levels might detect facial parts (noses, mouths), which are
then combined (routed) in higher-level capsules that represent a face. However, capsule net-
works use routing by agreement. In self-attention, the inputs compete with each other for how
much they contribute to a given output (via the softmax operation). In capsule networks, the
outputs of the layer compete with each other for inputs from earlier layers. Once we consider
self-attention as a routing network, we can question whether making this routing dynamic (i.e.,
dependent on the data) is necessary. The random synthesizer (Tay et al., 2021) removed the de-
pendence of the attention matrix on the inputs entirely and either used predetermined random
values or learned values. This performed surprisingly well across a variety of tasks.
Draft: please send errata to udlbookmail@gmail.com.
236 12 Transformers
Multi-head self-attention also has close connections to graph neural networks (see chapter 13),
convolution (Cordonnier et al., 2020), recurrent neural networks (Choromanski et al., 2020),
and memory retrieval in Hopeld networks (Ramsauer et al., 2021). For more information on
the relationships between transformers and other models, consult Prince (2021a).
Positional encoding: The original transformer paper (Vaswani et al., 2017) experimented
with predening the positional encoding matrix Π, and learning the positional encoding Π.
It might seem odd to add the positional encodings to the D × N data matrix X rather than
concatenate them. However, the data dimension D is usually greater than the number of
tokens N , so the positional encoding lies in a subspace. The word embeddings in X are learned,
so the system can theoretically keep the two components in orthogonal subspaces and retrieve
the positional encodings as required. The predened embeddings chosen by Vaswani et al.
(2017) were a family of sinusoidal components with two attractive properties: (i) the relative
position of two embeddings is easy to recover using a linear operation and (ii) their dot product
generally decreased as the distance between positions increased (see Prince, 2021a, for more
details). Many systems, such as GPT3 and BERT, learn positional encodings. Wang et al.
(2020a) examined the cosine similarities of the positional encodings in these models and showed
that they generally decline with relative distance, although they also have a periodic component.
Much subsequent work has modied just the attention matrix so that in the scaled dot-product
self-attention equation:
Sa[X] = V · Softmax
"
K
T
Q
D
q
#
, (12.16)
only the queries and keys contain position information:
V = β
v
1
T
+
v
X
Q = β
q
1
T
+
q
(X + Π)
K
=
β
k
1
T
+
k
(
X
+
Π
)
. (12.17)
This has led to the idea of multiplying out the quadratic component in the numerator of equa-
tion 12.16 and retaining only some of the terms. For example, Ke et al. (2021) decouple or untie
the content and position information by retaining only the content-content and position-position
terms and using dierent projection matrices
for each.
Another modication is to inject information directly about the relative position. This is more
important than absolute position since a batch of text can start at an arbitrary place in a
document. Shaw et al. (2018), Rael et al. (2020), and Huang et al. (2020b) all developed
systems where a single term was learned for each relative position oset, and the attention
matrix was modied in various ways using these relative positional encodings. Wei et al. (2019)
investigated relative positional encodings based on predened sinusoidal embeddings rather than
learned values. DeBERTa (He et al., 2021) combines these ideas; they retain only a subset of
terms from the quadratic expansion, apply dierent projection matrices to them, and use relative
positional encodings. Other work has explored sinusoidal embeddings that encode absolute and
relative position information in more complex ways (Su et al., 2021).
Wang et al. (2020a) compare the performance of transformers in BERT with dierent posi-
tional encodings. They found that relative positional encodings perform better than absolute
positional encodings, but there was little dierence between using sinusoidal and learned em-
beddings. A survey of positional encodings can be found in Dufter et al. (2021).
Extending transformers to longer sequences: The complexity of the self-attention mech-
anism increases quadratically with the sequence length. Some tasks like summarization or
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 237
question answering may require long inputs, so this quadratic dependence limits performance.
Three lines of work have attempted to address this problem. The rst decreases the size of the
attention matrix, the second makes the attention sparse, and the third modies the attention
mechanism to make it more ecient.
To decrease the size of the attention matrix, Liu et al. (2018b) introduced memory-compressed
attention. This applies strided convolution to the keys and values, which reduces the number
of positions in a very similar way to downsampling in a convolutional network. Attention is
now applied between weighted combinations of neighboring positions, where the weights are
learned. Along similar lines, Wang et al. (2020b) observed that the quantities in the attention
mechanism are often low rank in practice and developed the LinFormer, which projects the keys
and values onto a smaller subspace before computing the attention matrix.
To make attention sparse, Liu et al. (2018b) proposed local attention, in which neighboring
blocks of tokens only attend to one another. This creates a block diagonal interaction matrix (see
gure 12.15). Information cannot pass from block to block, so such layers are typically alternated
with full attention. Along the same lines, GPT3 (Brown et al., 2020) uses a convolutional
interaction matrix and alternates this with full attention. Child et al. (2019) and Beltagy et al.
(2020) experimented with various interaction matrices, including convolutional structures with
dierent dilation rates but allowing some queries to interact with every other key. Ainslie
et al. (2020) introduced the extended transformer construction (gure 12.15h), which uses a
set of global embeddings that interact with every other token. This can only be done in the
encoder version, or these implicitly allow the system to “look ahead. When combined with
relative position encoding, this scheme requires special encodings for mapping to, from, and
between these global embeddings. BigBird (Ainslie et al., 2020) combined global embeddings
and a convolutional structure with a random sampling of possible connections. Other work
has investigated learning the sparsity pattern of the attention matrix (Roy et al., 2021; Kitaev
et al., 2020; Tay et al., 2020).
Finally, it has been noted that the terms in the numerator and denominator of the softmax oper-
ation that computes attention have the form exp[k
T
q]. This can be treated as a kernel function
and, as such, can be expressed as the dot product g[k]
T
g[q] where g[] is a nonlinear transforma-
Problem 12.10
tion. This formulation decouples the queries and keys, making the attention computation more
ecient. Unfortunately, to replicate the form of the exponential terms, the transformation g[]
must map the inputs to the innite space. The linear transformer (Katharopoulos et al., 2020)
recognizes this and replaces the exponential term with a dierent similarity measure. The Per-
former (Choromanski et al., 2020) approximates this innite mapping with a nite-dimensional
one. More details about extending transformers to longer sequences can be found in Tay et al.
(2023) and Prince (2021a).
Training transformers: Training transformers is challenging and requires both learning rate
warm-up (Goyal et al., 2018) and Adam (Kingma & Ba, 2015). Indeed Xiong et al. (2020a) and
Huang et al. (2020a) show that the gradients vanish, and the Adam updates decrease in magni-
tude without learning rate warm-up. Several interacting factors cause this problem. Residual
connections cause the exploding gradients (gure 11.6), but normalization layers prevent this.
Vaswani et al. (2017) used LayerNorm rather than BatchNorm because NLP statistics are highly
variable between batches, although subsequent work has modied BatchNorm for transformers
(Shen et al., 2020a). The positioning of the LayerNorm outside of the residual block causes
gradients to shrink as they pass back through the network (Xiong et al., 2020a). In addition,
the relative weight of the residual connections and main self-attention mechanism varies as we
move through the network upon initialization (see gure 11.6c). There is the additional com-
plication that the gradients for the query and key parameters are smaller than for the value
parameters (Liu et al., 2020), which necessitates the use of Adam. These factors interact in a
complex way, making training unstable and necessitating learning rate warm-up.
There have been various attempts to stabilize training, including (i) a variation of FixUp called
TFixup (Huang et al., 2020a) that allows the LayerNorm components to be removed, (ii) chang-
Draft: please send errata to udlbookmail@gmail.com.
238 12 Transformers
ing the position of the LayerNorm components in the network (Liu et al., 2020), and (iii)
re-weighting the two paths in the residual branches (Liu et al., 2020; Bachlechner et al., 2021).
Xu et al. (2021b) introduced an initialization scheme called DTFixup that allows transformers
to be trained with smaller datasets. A detailed discussion can be found in Prince (2021b).
Applications in vision: ImageGPT (Chen et al., 2020a) and the Vision Transformer (Doso-
vitskiy et al., 2021) were both early transformer architectures applied to images. Transformers
have been used for image classication (Dosovitskiy et al., 2021; Touvron et al., 2021), object
detection (Carion et al., 2020; Zhu et al., 2020b; Fang et al., 2021), semantic segmentation (Ye
et al., 2019; Xie et al., 2021; Gu et al., 2022), super-resolution (Yang et al., 2020a), action
recognition (Sun et al., 2019; Girdhar et al., 2019), image generation (Chen et al., 2021b; Nash
et al., 2021), visual question answering (Su et al., 2019b; Tan & Bansal, 2019), inpainting (Wan
et al., 2021; Zheng et al., 2021; Zhao et al., 2020b; Li et al., 2022), colorization (Kumar et al.,
2021), and many other vision tasks (Khan et al., 2022; Liu et al., 2023b).
Transformers and convolutional networks: Transformers have been combined with con-
volutional neural networks for many tasks, including image classication (Wu et al., 2020a),
object detection (Hu et al., 2018a; Carion et al., 2020), video processing (Wang et al., 2018c;
Sun et al., 2019), unsupervised object discovery (Locatello et al., 2020) and various text/vision
tasks (Chen et al., 2020d; Lu et al., 2019; Li et al., 2019). Transformers can outperform convolu-
tional networks for vision tasks but usually require large quantities of data to achieve superior
performance. Often, they are pre-trained on enormous datasets like JRT (Sun et al., 2017)
and LAION (Schuhmann et al., 2021). The transformer doesn’t have the inductive bias of
convolutional networks, but by using huge amounts of data, it can surmount this disadvantage.
From pixels to video: Non-local networks (Wang et al., 2018c) were an early application of
self-attention to image data. Transformers were initially applied to pixels in local neighborhoods
(Parmar et al., 2018; Hu et al., 2019; Parmar et al., 2019; Zhao et al., 2020a). ImageGPT (Chen
et al., 2020a) scaled this to model all pixels in a small image. The Vision Transformer (ViT)
(Dosovitskiy et al., 2021) used non-overlapping patches to analyze bigger images.
Since then, many multi-scale systems have been developed, including the SWin transformer
(Liu et al., 2021c), SWinV2 (Liu et al., 2022), multi-scale transformers (MViT) (Fan et al.,
2021), and pyramid vision transformers (Wang et al., 2021). The Crossformer (Wang et al.,
2022b) models interactions between spatial scales. Ali et al. (2021) introduced cross-covariance
image transformers, in which the channels rather than spatial positions attend to one another,
hence making the size of the attention matrix indierent to the image size. The dual attention
vision transformer (DaViT) was developed by Ding et al. (2022) and alternates between local
spatial attention within sub-windows and spatially global attention between channels. Chu et al.
(2021) similarly alternate between local attention within sub-windows and global attention by
subsampling the spatial domain. Dong et al. (2022) adapt the ideas of gure 12.15, in which
the interactions between elements are sparsied to the 2D image domain.
Transformers were subsequently adapted to video processing (Arnab et al., 2021; Bertasius et al.,
2021; Liu et al., 2021c; Neimark et al., 2021; Patrick et al., 2021). A survey of transformers
applied to video can be found in Selva et al. (2022).
Combining images and text: CLIP (Radford et al., 2021) learns a joint encoder for images
and their captions using a contrastive pre-training task. The system ingests N images and
their captions and produces a matrix of compatibility between images and captions. The loss
function encourages the correct pairs to have a high score and the incorrect pairs to have a low
score. Ramesh et al. (2021) and Ramesh et al. (2022) train a diusion decoder to invert the
CLIP image encoder for text-conditional image generation (see chapter 18).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 239
Problems
Problem 12.1 Consider a self-attention mechanism that processes N inputs of length D to
produce N outputs of the same size. How many weights and biases are used to compute the
queries, keys, and values? How many attention weights a[, ] will there be? How many weights
and biases would there be in a fully connected shallow network relating all DN inputs to all DN
outputs?
Problem 12.2 Why might we want to ensure that the input to the self-attention mechanism is
the same size as the output?
Problem 12.3
Show that the self-attention mechanism (equation 12.8) is equivariant to a
permutation XP of the data X, where P is a permutation matrix. In other words, show that:
Appendix B.4.4
Permutation
matrix
Sa[XP] = Sa[X]P. (12.18)
Problem 12.4 Consider the softmax operation:
y
i
= softmax
i
[z] =
exp[z
i
]
P
5
j=1
exp[z
j
]
, (12.19)
in the case where there are ve inputs with values: z
1
= 3, z
2
= 1, z
3
= 100, z
4
= 5, z
5
= 1.
Compute the 25 derivatives, y
i
/∂z
j
for all i, j {1, 2, 3, 4, 5}. What do you conclude?
Problem 12.5 Why is implementation more ecient if the values, queries, and keys in each of
the H heads each have dimension D/H where D is the original dimension of the data?
Problem 12.6 BERT was pre-trained using two tasks. The rst task requires the system to pre-
dict missing (masked) words. The second task requires the system to classify pairs of sentences
as being adjacent or not in the original text. Identify whether each of these tasks is generative
or contrastive (see section 9.3.6). Why do you think they used two tasks? Propose two novel
contrastive tasks that could be used to pre-train a language model.
Problem 12.7 Consider adding a new token to a precomputed masked self-attention mechanism
with N tokens. Describe the extra computation that must be done to incorporate this new
token.
Problem 12.8 Computation in vision transformers expands quadratically with the number of
patches. Devise two methods to reduce the computation using the principles from gure 12.15.
Problem 12.9 Consider representing an image with a grid of 16 × 16 patches, each represented
by a patch embedding of length 512. Compare the amount of computation required in the
DaViT transformer to perform attention (i) between the patches, using all of the channels, and
(ii) between the channels, using all of the patches.
Problem 12.10
Attention weights are usually computed as:
a[x
m
, x
n
] = softmax
m
h
k
T
q
n
i
=
exp
k
T
m
q
n
P
N
m
=1
exp
k
T
m
q
n
. (12.20)
Consider replacing exp
k
T
m
q
n
with the dot product g[k
m
]
T
g[q
n
] where g[] is a nonlinear
transformation. Show how this makes the computation of the attention weights more ecient.
Draft: please send errata to udlbookmail@gmail.com.
Chapter 13
Graph neural networks
Chapter 10 described convolutional networks, which specialize in processing regular ar-
rays of data (e.g., images). Chapter 12 described transformers, which specialize in pro-
cessing sequences of variable length (e.g., text). This chapter describes graph neural
networks. As the name suggests, these are neural architectures that process graphs (i.e.,
sets of nodes connected by edges).
There are three novel challenges associated with processing graphs. First, their topol-
ogy is variable, and it is hard to design networks that are both suciently expressive and
can cope with this variation. Second, graphs may be enormous; a graph representing
connections between users of a social network might have a billion nodes. Third, there
may only be a single monolithic graph available, so the usual protocol of training with
many data examples and testing with new data is not always appropriate.
This chapter starts by presenting real-world examples of graphs. It then describes
how to encode these graphs and how to formulate supervised learning problems for
graphs. The algorithmic requirements for processing graphs are discussed, and these lead
naturally to graph convolutional networks, a particular type of graph neural network.
13.1 What is a graph?
A graph is a very general structure and consists of a set of nodes or vertices, where pairs
of nodes are connected by edges or links. Graphs are typically sparse; only a small subset
of the possible edges are present.
Some objects in the real world naturally take the form of graphs. For example,
road networks can be considered graphs where the nodes are physical locations, and the
edges represent roads between them (gure 13.1a). Chemical molecules are small graphs
where the nodes represent atoms, and the edges represent chemical bonds (gure 13.1b).
Electrical circuits are graphs where the nodes represent components and junctions, and
the edges are electrical connections (gure 13.1c).
Furthermore, many datasets can also be represented by graphs, even if this is not
their obvious surface form. For example:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
13.1 What is a graph? 241
Figure 13.1 Real-world graphs. Some objects, such as a) road networks, b)
molecules, and c) electrical circuits, are naturally structured as graphs.
Social networks are graphs where nodes are people, and the edges represent friend-
ships between them.
The scientic literature can be viewed as a graph where the nodes are papers, and
the edges represent citations.
Wikipedia can be considered a graph where the nodes are articles, and the edges
represent hyperlinks between articles.
Computer programs can be represented as graphs where the nodes are syntax
tokens (variables at dierent points in the program ow), and the edges represent
computations involving these variables.
Geometric point clouds can be represented as graphs. Here, each point is a node
with edges connecting to other nearby points.
Protein interactions in a cell can be expressed as graphs, where the nodes are the
proteins, and there is an edge between two proteins if they interact.
In addition, a set (an unordered list) can be treated as a graph in which every member
is a node and connects to every other. An image can be treated as a graph with regular
topology, in which each pixel is a node with edges to the adjacent pixels.
13.1.1 Types of graphs
Graphs can be categorized in various ways. The social network in gure 13.2a contains
undirected edges; each pair of individuals with a connection between them have mutually
agreed to be friends, so there is no sense that the relationship is directional. In contrast,
the citation network in gure 13.2b contains directed edges. Each paper cites other
papers, and this relationship is inherently one-way.
Figure 13.2c depicts a knowledge graph that encodes a set of facts about objects by
dening relations between them. Technically, this is a directed heterogeneous multigraph.
It is heterogeneous because the nodes can represent dierent types of entities (e.g., people,
countries, companies). It is a multigraph because there can be multiple edges of dierent
types between any two nodes.
Draft: please send errata to udlbookmail@gmail.com.
242 13 Graph neural networks
Figure 13.2 Types of graphs. a) A social network is an undirected graph; the
connections between people are symmetric. b) A citation network is a directed
graph; one publication cites another, so the relationship is asymmetric. c) A
knowledge graph is a directed heterogeneous multigraph. The nodes are hetero-
geneous in that they represent dierent object types (people, places, companies)
and multiple edges may represent dierent relations between each node. d) A
point set can be converted to a graph by forming edges between nearby points.
Each node has an associated position in 3D space, and this is termed a geometric
graph (adapted from Hu et al., 2022). e) The scene on the left can be represented
by a hierarchical graph. The topology of the room, table, and light are all repre-
sented by graphs. These graphs form nodes in a larger graph representing object
adjacency (adapted from Fernández-Madrigal & González, 2002).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
13.2 Graph representation 243
Figure 13.3 Graph representation. a) Example graph with six nodes and seven
edges. Each node has an associated embedding of length ve (brown vectors).
Each edge has an associated embedding of length four (blue vectors). This graph
can be represented by three matrices. b) The adjacency matrix is a binary matrix
where element (m, n) is set to one if node m connects to node n. c) The node
data matrix X contains the concatenated node embeddings. d) The edge data
matrix E contains the edge embeddings.
The point set representing the airplane in gure 13.2d can be converted into a graph
by connecting each point to its K nearest neighbors. The result is a geometric graph
where each point is associated with a position in 3D space. Figure 13.2e represents a
hierarchical graph. The table, light, and room are each described by graphs representing
the adjacency of their respective components. These three graphs are themselves nodes
in another graph that represents the topology of the objects in a larger model.
All types of graphs can be processed using deep learning. However, this chapter
focuses on undirected graphs like the social network in gure 13.2a.
13.2 Graph representation
In addition to the graph structure itself, information is typically associated with each
node. For example, in a social network, each individual might be characterized by a xed-
length vector representing their interests. Sometimes, the edges also have information
attached. For example, in the road network example, each edge might be characterized
by its length, number of lanes, frequency of accidents, and speed limit. The information
at a node is stored in a node embedding, and the information at an edge is stored in an
edge embedding.
More formally, a graph consists of a set of N nodes connected by a set of E edges. The
graph can be encoded by three matrices A, X, and E, representing the graph structure,
node embeddings, and edge embeddings, respectively (gure 13.3).
Draft: please send errata to udlbookmail@gmail.com.
244 13 Graph neural networks
Figure 13.4 Properties of the adjacency matrix. a) Example graph. b) Posi-
tion (m, n) of the adjacency matrix A contains the number of walks of length one
from node m to node n. c) Position (m, n) of the squared adjacency matrix A
2
contains the number of walks of length two from node n to node m. d) One hot
vector representing node six, which was highlighted in panel (a). e) When we
pre-multiply this vector by A, the result contains the number of walks of length
one from node six to each node; we can reach nodes ve, seven, and eight in one
move. f) When we pre-multiply this vector by A
2
, the resulting vector contains
the number of walks of length two from node six to each node; we can reach nodes
two, three, four, ve, and eight in two moves, and we can return to the original
node in three dierent ways (via nodes ve, seven, and eight).
The graph structure is represented by the adjacency matrix, A. This is an N × N
matrix where entry (m, n) is set to one if there is an edge between nodes m and n and
Problems 13.1–13.2
zero otherwise. For undirected graphs, this matrix is always symmetric. For large sparse
graphs, it can be stored as a list of connections (m, n) to save memory.
The n
th
node has an associated node embedding x
(n)
of length D. These embeddings
are concatenated and stored in the D×N node data matrix X. Similarly, the e
th
edge has
an associated edge embedding e
(e)
of length D
E
. These edge embeddings are collected
into the D
E
× E matrix E. For simplicity, we initially consider graphs that only have
node embeddings and return to edge embeddings in section 13.9.
13.2.1 Properties of the adjacency matrix
The adjacency matrix can be used to nd the neighbors of a node using linear algebra.
Consider encoding the n
th
node as a one-hot column vector (a vector with only one
non-zero entry at position n, which is set to one). When we pre-multiply this vector by
the adjacency matrix, it extracts the n
th
column of the adjacency matrix and returns a
vector with ones at the positions of the neighbors (i.e., all the places we can reach in a
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
13.3 Graph neural networks, tasks, and loss functions 245
walk of length one from the n
th
node). If we repeat this procedure (i.e., pre-multiply
by A again), the resulting vector contains the number of walks of length two from node n
Problems 13.3–13.4
to every node (gures 13.4d–f).
In general, if we raise the adjacency matrix to the power of L, the entry at posi-
tion (m, n) of A
L
contains the number of unique walks of length L from node n to
Notebook 13.1
Encoding
graphs
node m (gures 13.4a–c). This is not the same as the number of unique paths since it
includes routes that visit the same node more than once. Nonetheless, A
L
still contains
valuable information about the graph connectivity; a non-zero entry at position (m, n)
indicates that the distance from m to n must be less than or equal to L.
13.2.2 Permutation of node indices
Node indexing in graphs is arbitrary; permuting the node indices results in a permu-
tation of the columns of the node data matrix X and a permutation of both the rows
and columns of the adjacency matrix A. However, the underlying graph is unchanged
(gure 13.5). This is in contrast to images, where permuting the pixels creates a dierent
image, and to text, where permuting the words creates a dierent sentence.
The operation of exchanging node indices can be expressed mathematically by a
permutation matrix, P. This is a matrix where exactly one entry in each row and
column take the value one, and the remaining values are zero. When position (m, n) of
the permutation matrix is set to one, it indicates that node m will become node n after
Problem 13.5
the permutation. To map from one indexing to another, we use the operations:
X
= XP
A
= P
T
AP, (13.1)
where post-multiplying by P permutes the columns and pre-multiplying by P
T
permutes
the rows. It follows that any processing applied to the graph should also be indierent
to these permutations. Otherwise, the result will depend on the choice of node indices.
13.3 Graph neural networks, tasks, and loss functions
A graph neural network is a model that takes the node embeddings X and the adjacency
matrix A as inputs and passes them through a series of K layers. The node embeddings
are updated at each layer to create intermediate “hidden” representations H
k
before
nally computing output embeddings H
K
.
At the start of this network, each column of the input node embeddings X just con-
tains information about the node itself. At the end, each column of the model output H
K
includes information about the node and its context within the graph. This is similar to
word embeddings passing through a transformer network. These represent words at the
start, but represent the word meanings in the context of the sentence at the end.
Draft: please send errata to udlbookmail@gmail.com.
246 13 Graph neural networks
Figure 13.5 Permutation of node indices. a) Example graph, b) associated adja-
cency matrix and c) node embeddings. d) The same graph where the (arbitrary)
order of the indices has been changed. e) The adjacency matrix and f) node
matrix are now dierent. Consequently, any network layer that operates on the
graph should be indierent to the ordering of the nodes.
13.3.1 Tasks and loss functions
We defer discussion of graph neural network models until section 13.4 and rst describe
the types of problems these networks tackle and their associated loss functions. Super-
vised graph problems usually fall into one of three categories (gure 13.6).
Graph-level tasks: The network assigns a label or estimates one or more values from
the entire graph, exploiting both the structure and node embeddings. For example, we
might want to predict the temperature at which a molecule becomes liquid (a regression
task) or whether a molecule is poisonous to human beings or not (a classication task).
For graph-level tasks, the output node embeddings are combined (e.g., by averaging),
and the resulting vector is mapped via a linear transformation or neural network to a
xed-size vector. For regression, the mismatch between the result and the ground truth
values is computed using the least squares loss. For binary classication, the output
is passed through a sigmoid function, and the mismatch is calculated using the binary
cross-entropy loss. Here, the probability that the graph belongs to class one might be
given by:
P r(y = 1|X, A) = sig [β
K
+ ω
K
H
K
1/N] , (13.2)
where the scalar β
K
and 1 ×D vector ω
K
are learned parameters. Post-multiplying the
output embedding matrix H
K
by the column vector 1 that contains ones has the eect
of summing together all the embeddings and subsequently dividing by the number of
nodes N computes the average. This is known as mean pooling (see gure 10.11).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
13.3 Graph neural networks, tasks, and loss functions 247
Figure 13.6 Common tasks for graphs. In each case, the input is a graph repre-
sented by its adjacency matrix and node embeddings. The graph neural network
processes the node embeddings by passing them through a series of layers. The
node embeddings at the last layer contain information about both the node and
its context in the graph. a) Graph classication. The node embeddings are com-
bined (e.g., by averaging) and then mapped to a xed-size vector that is passed
through a softmax function to produce class probabilities. b) Node classication.
Each node embedding is used individually as the basis for classication (cyan
and orange colors represent assigned node classes). c) Edge prediction. Node
embeddings adjacent to the edge are combined (e.g., by taking the dot product)
to compute a single number that is mapped via a sigmoid function to produce a
probability that a missing edge should be present.
Draft: please send errata to udlbookmail@gmail.com.
248 13 Graph neural networks
Node-level tasks: The network assigns a label (classication) or one or more values
(regression) to each node of the graph, using both the graph structure and node em-
beddings. For example, given a graph constructed from a 3D point cloud similar to
gure 13.2d, the goal might be to classify the nodes according to whether they belong
to the wings or fuselage. Loss functions are dened in the same way as for graph-level
tasks, except that now this is done independently at each node n:
P r(y
(n)
= 1|X, A) = sig
h
β
K
+ ω
K
h
(n)
K
i
. (13.3)
Edge prediction tasks: The network predicts whether or not there should be an edge
between nodes n and m. For example, in the social network setting, the network might
predict whether two people know and like each other and suggest that they connect if
that is the case. This is a binary classication task where the two node embeddings must
be mapped to a single number representing the probability that the edge is present. One
possibility is to take the dot product of the node embeddings and pass the result through
a sigmoid function to create the probability:
P r(y
(mn)
= 1 |X, A) = sig
h
h
(m)T
h
(n)
i
. (13.4)
13.4 Graph convolutional networks
There are many types of graph neural networks, but here we focus on spatial-based
convolutional graph neural networks, or GCNs for short. These models are convolutional
in that they update each node by aggregating information from nearby nodes. As such,
they induce a relational inductive bias (i.e., a bias toward prioritizing information from
neighbors). They are spatial-based because they use the original graph structure. This
contrasts with spectral-based methods, which apply convolutions in the Fourier domain.
Each layer of the GCN is a function F[] with parameters Φ that takes the node
embeddings and adjacency matrix and outputs new node embeddings. The network can
hence be written as:
H
1
= F[X, A, ϕ
0
]
H
2
= F[H
1
, A, ϕ
1
]
H
3
= F[H
2
, A, ϕ
2
]
.
.
. =
.
.
.
H
K
= F[H
K1
, A, ϕ
K1
], (13.5)
where X is the input, A is the adjacency matrix, H
k
contains the modied node em-
beddings at the k
th
layer, and ϕ
k
denotes the parameters that map from layer k to
layer k+1.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
13.4 Graph convolutional networks 249
13.4.1 Equivariance and invariance
We noted before that the indexing of the nodes in the graph is arbitrary, and any
permutation of the node indices does not change the graph. It is hence imperative that
any model respects this property. It follows that each layer must be equivariant (see
section 10.1) with respect to permutations of the node indices. In other words, if we
permute the node indices, the node embeddings at each stage will be permuted in the
same way. In mathematical terms, if P is a permutation matrix, then we must have:
H
k+1
P = F[H
k
P, P
T
AP, ϕ
k
]. (13.6)
For node classication and edge prediction tasks, the output should also be equiv-
ariant with respect to permutations of the node indices. However, for graph-level tasks,
the nal layer aggregates information from across the graph, so the output is invariant
Problem 13.6
to the node order. In fact, the output layer from equation 13.2 achieves this because:
y = sig [β
K
+ ω
K
H
K
1/N] = sig [β
K
+ ω
K
H
K
P1/N] , (13.7)
for any permutation matrix P (see problem 13.6).
This mirrors the case for images, where segmentation should be equivariant to geo-
metric transformations, and image classication should be invariant (gure 10.1). Here,
convolutional and pooling layers partially achieve this with respect to translations, but
there is no known way to guarantee these properties exactly for more general transfor-
mations. However, for graphs, it is possible to dene networks that ensure equivariance
or invariance to permutations.
13.4.2 Parameter sharing
Chapter 10 argued applying fully connected networks to images isn’t sensible because
this requires the network to learn how to recognize an object separately at every image
position. Instead, we used convolutional layers that processed every position in the image
identically. This reduced the number of parameters and introduced an inductive bias
that forced the model to treat every part of the image in the same way.
The same argument can be made about nodes in a graph. We could learn a model
with separate parameters associated with each node. However, now the network must
independently learn the meaning of the connections in the graph at each position, and
training would require many graphs with the same topology. Instead, we build a model
that uses the same parameters at every node, reducing the number of parameters and
sharing what the network learns at each node across the entire graph.
Recall that a convolution (equation 10.3) updates a variable by taking a weighted
sum of information from its neighbors. One way to think of this is that each neighbor
sends a message to the variable of interest, which aggregates these messages to form the
update. When we considered images, the neighbors were pixels from a xed-size square
region around the current position, so the spatial relationships at each position are the
same. However, in a graph, each node may have a dierent number of neighbors, and
there are no consistent relationships; there is no sense that we can weight information
Draft: please send errata to udlbookmail@gmail.com.
250 13 Graph neural networks
Figure 13.7 Simple Graph CNN layer. a) Input graph consists of structure (em-
bodied in graph adjacency matrix A, not shown) and node embeddings (stored
in columns of X). b) Each node in the rst hidden layer is updated by (i) ag-
gregating the neighboring nodes to form a single vector, (ii) applying a linear
transformation
0
to the aggregated nodes, (iii) applying the same linear trans-
formation
0
to the original node, (iv) adding these together with a bias β
0
,
and nally (v) applying a nonlinear activation function a[] like a ReLU. c) This
process is repeated at subsequent layers (but with dierent parameters for each
layer) until we produce the nal embeddings at the end of the network.
from a node that is “above” the node of interest dierently to information from a node
that is “below” it.
13.4.3 Example GCN layer
These considerations lead to a simple GCN layer (gure 13.7). At each node n in layer k,
we aggregate information from neighboring nodes by summing their node embeddings h
:
agg[n, k] =
X
mne[n]
h
(m)
k
, (13.8)
where ne[n] returns the set of indices of the neighbors of node n. Then we apply a
linear transformation
k
to the embedding h
(n)
k
at the current node and to this ag-
gregated value, add a bias term β
k
, and pass the result through a nonlinear activation
function a[], which is applied independently to every member of its vector argument:
h
(n)
k+1
= a
h
β
k
+
k
· h
(n)
k
+
k
· agg[n, k]
i
. (13.9)
We can write this more succinctly by noting that post-multiplication of a matrix
by a vector returns a weighted sum of its columns. The n
th
column of the adjacency
matrix A contains ones at the positions of the neighbors. Hence, if we collect the node
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
13.5 Example: graph classication 251
embeddings into the D × N matrix H
k
and post-multiply by the adjacency matrix A,
the n
th
column of the result is agg[n, k]. The update for the nodes is now:
H
k+1
= a
β
k
1
T
+
k
H
k
+
k
H
k
A
= a
β
k
1
T
+
k
H
k
(A + I)
, (13.10)
where 1 is an N ×1 vector containing ones. Here, the nonlinear activation function a[]
is applied independently to every member of its matrix argument.
This layer satises the design considerations: it is equivariant to permutations of the
Problem 13.7
node indices, can cope with any number of neighbors, exploits the graph structure to
provide a relational inductive bias, and shares parameters throughout the graph.
13.5 Example: graph classication
We now combine these ideas to describe a network that classies molecules as toxic or
harmless. The network inputs are the adjacency matrix and node embedding matrix X.
Notebook 13.2
Graph classication
The adjacency matrix A R
N×N
derives from the molecular structure. The columns
of the node embedding matrix X R
118×N
are one-hot vectors indicating which of the
118 elements of the periodic table are present. In other words, they are vectors of length
118 where every position is zero except for the position corresponding to the relevant
element, which is set to one. The node embeddings can be transformed to an arbitrary
size D by the rst weight matrix
0
R
D×118
.
The network equations are:
H
1
= a
β
0
1
T
+
0
X(A + I)
H
2
= a
β
1
1
T
+
1
H
1
(A + I)
.
.
. =
.
.
.
H
K
= a
β
K1
1
T
+
K1
H
k1
(A + I)
f[X, A, Φ] = sig [β
K
+ ω
K
H
K
1/N] , (13.11)
where the network output f[X, A, Φ] is a single value that determines the probability
that the molecule is toxic (see equation 13.2).
13.5.1 Training with batches
Given I training graphs {X
i
, A
i
} and their labels y
i
, the parameters Φ = {β
k
,
k
}
K
k=0
can be learned using SGD and the binary cross-entropy loss (equation 5.19). Fully
connected networks, convolutional networks, and transformers all exploit the parallelism
of modern hardware to process an entire batch of training examples concurrently. To this
end, the batch elements are concatenated into a higher-dimensional tensor (section 7.4.2).
Draft: please send errata to udlbookmail@gmail.com.
252 13 Graph neural networks
Figure 13.8 Inductive vs. transductive problems. a) Node classication task in the
inductive setting. We are given a set of I training graphs, where the node labels
(orange and cyan colors) are known. After training, we are given a test graph
and must assign labels to each node. b) Node classication in the transductive
setting. There is one large graph in which some nodes have labels (orange and
cyan colors), and others are unknown. We train the model to predict the known
labels correctly and then examine the predictions at the unknown nodes.
However, each graph may have a dierent number of nodes. Hence, the matrices X
i
and
A
i
have dierent sizes, and there is no way to concatenate them into 3D tensors.
Luckily, a simple trick allows us to process the whole batch in parallel. The graphs
in the batch are treated as disjoint components of a single large graph. The network can
then be run as a single instance of the network equations. The mean pooling is carried
out only over the individual graphs to make a single representation per graph that can
be fed into the loss function.
13.6 Inductive vs. transductive models
Until this point, all of the models in this book have been inductive: we exploit a training
set of labeled data to learn the relation between the inputs and outputs. Then we apply
this to new test data. One way to think of this is that we are learning the rule that maps
inputs to outputs and then applying it elsewhere.
By contrast, a transductive model considers both the labeled and unlabeled data
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
13.7 Example: node classication 253
at the same time. It does not produce a rule but merely a labeling for the unknown
outputs. This is sometimes termed semi-supervised learning. It has the advantage that
it can use patterns in the unlabeled data to help make its decisions. However, it has the
disadvantage that the model needs to be retrained when extra unlabeled data are added.
Both problem types are commonly encountered for graphs (gure 13.8). Sometimes,
we have many labeled graphs and learn a mapping between the graph and the labels.
For example, we might have many molecules, each labeled according to whether it is
toxic to humans. We learn the rule that maps the graph to the toxic/non-toxic label and
then apply this rule to new molecules. However, sometimes there is a single monolithic
graph. In the graph of scientic paper citations, we might have labels indicating the eld
(physics, biology, etc.) for some nodes and wish to label the remaining nodes. Here, the
training and test data are irrevocably connected.
Graph-level tasks only occur in the inductive setting where there are training and test
graphs. However, node-level tasks and edge prediction tasks can occur in either setting.
In the transductive case, the loss function minimizes the mismatch between the model
output and the ground truth where this is known. New predictions are computed by
running the forward pass and retrieving the results where the ground truth is unknown.
13.7 Example: node classication
As a second example, consider a binary node classication task in a transductive setting.
We start with a commercial-sized graph with millions of nodes. Some nodes have ground
truth binary labels, and the goal is to label the remaining unlabeled nodes. The body
of the network will be the same as in the previous example (equation 13.11) but with a
dierent nal layer that produces an output vector of size 1 × N:
f[X, A, Φ] = sig
β
K
1
T
+ ω
K
H
K
, (13.12)
where the function sig[] applies the sigmoid function independently to every element
of the row vector input. As usual, we use the binary cross-entropy loss, but now only
at nodes where we know the ground truth label y. Note that equation 13.12 is just a
vectorized version of the node classication loss from equation 13.3.
Training this network raises two problems. First, it is logistically dicult to train a
graph neural network of this size. Consider that we must store the node embeddings at
every network layer in the forward pass. This will involve both storing and processing
a structure several times the size of the entire graph, and this may not be practical.
Second, we have only a single graph, so it’s not obvious how to perform stochastic
gradient descent. How can we form a batch if there is only a single object?
13.7.1 Choosing batches
One way to form a batch is to choose a random subset of labeled nodes at each training
step. Each node depends on its neighbors in the previous layer. These, in turn, depend
Draft: please send errata to udlbookmail@gmail.com.
254 13 Graph neural networks
Figure 13.9 Receptive elds in graph neural networks. Consider the orange node
in hidden layer two (right). This receives input from the nodes in the 1-hop
neighborhood in hidden layer one (shaded region in center). These nodes in
hidden layer one receive inputs from their neighbors in turn, and the orange node
in layer two receives inputs from all the input nodes in the 2-hop neighborhood
(shaded area on left). The region of the graph that contributes to a given node
is equivalent to the notion of a receptive eld in convolutional neural networks.
on their neighbors in the layer before, so each node has the equivalent of a receptive eld
(gure 13.9). The size of the receptive eld is termed the k-hop neighborhood. We can
hence perform a gradient descent step using the graph that forms the union of the k-hop
neighborhoods of the batch nodes; the remaining inputs do not contribute.
Unfortunately, if there are many layers and the graph is densely connected, every
input node may be in the receptive eld of every output, and this may not reduce the
graph size at all. This is known as the graph expansion problem. Two approaches that
tackle this problem are neighborhood sampling and graph partitioning.
Neighborhood sampling: The full graph that feeds into the batch of nodes is sampled,
thereby reducing the connections at each network layer (gure 13.10). For example, we
might start with the batch nodes and randomly sample a xed number of their neighbors
Notebook 13.3
Neighborhood
sampling
in the previous layer. Then, we randomly sample a xed number of their neighbors in
the layer before, and so on. The graph still increases in size with each layer but in
a much more controlled way. This is done anew for each batch, so the contributing
neighbors dier even if the same batch is drawn twice. This is also reminiscent of
dropout (section 9.3.3) and adds some regularization.
Graph partitioning: A second approach is to cluster the original graph into disjoint
subsets of nodes (i.e., smaller graphs that are not connected to one another) before
processing (gure 13.11). There are standard algorithms to choose these subsets to
maximize the number of internal links. These smaller graphs can each be treated as
batches, or a random subset of them can be combined to form a batch (reinstating any
edges between them from the original graph).
Given one of the above methods to form batches, we can now train the network
parameters in the same way as for the inductive setting, dividing the labeled nodes into
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
13.7 Example: node classication 255
Figure 13.10 Neighborhood sampling. a) One way of forming batches on large
graphs is to choose a subset of labeled nodes in the output layer (here, just one
node in layer two, right) and then working back to nd all of the nodes in the K-
hop neighborhood (receptive eld). Only this sub-graph is needed to train this
batch. Unfortunately, if the graph is densely connected, this may retain a large
proportion of the graph. b) One solution is neighborhood sampling. As we work
back from the nal layer, we select a subset of neighbors (here, three) in the
layer before and a subset of the neighbors of these in the layer before that. This
restricts the size of the graph for training the batch. In all panels, the brightness
represents the distance from the original node.
Draft: please send errata to udlbookmail@gmail.com.
256 13 Graph neural networks
Figure 13.11 Graph partitioning. a) Input graph. b) The input graph is parti-
tioned into smaller subgraphs using a principled method that removes the fewest
edges. c-d) We can now use these subgraphs as batches to train in a transductive
setting, so here, there are four possible batches. e) Alternatively, we can use
combinations of the subgraphs as batches, reinstating the edges between them.
If we use pairs of subgraphs, there would be six possible batches here.
train, test, and validation sets as desired; we have eectively converted a transductive
problem to an inductive one. To perform inference, we compute predictions for the
unknown nodes based on their k-hop neighborhood. Unlike training, this does not require
storing the intermediate representations, so it is much more memory ecient.
13.8 Layers for graph convolutional networks
In the previous examples, we combined messages from adjacent nodes by summing them
together with the transformed current node. This was accomplished by post-multiplying
the node embedding matrix H by the adjacency matrix plus the identity A + I. We now
consider dierent approaches to both (i) the combination of the current embedding with
the aggregated neighbors and (ii) the aggregation process itself.
13.8.1 Combining current node and aggregated neighbors
In the example GCN layer above, we combined the aggregated neighbors HA with the
current nodes H by just summing them:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
13.8 Layers for graph convolutional networks 257
H
k+1
= a
h
β
k
1
T
+
k
H
k
(A + I)
i
. (13.13)
In another variation, the current node is multiplied by a factor of (1 + ϵ
k
) before con-
tributing to the sum, where ϵ
k
is a learned scalar that is dierent for each layer:
H
k+1
= a
h
β
k
1
T
+
k
H
k
(A + (1 + ϵ
k
)I)
i
. (13.14)
This is known as diagonal enhancement. A related variation applies a dierent linear
transform Ψ
k
to the current node:
H
k+1
= a
β
k
1
T
+
k
H
k
A + Ψ
k
H
k
= a
β
k
1
T
+
k
Ψ
k
H
k
A
H
k

= a
β
k
1
T
+
k
H
k
A
H
k

, (13.15)
where we have dened
k
=
k
Ψ
k
in the third line.
13.8.2 Residual connections
With residual connections, the aggregated representation from the neighbors is trans-
formed and passed through the activation function before summation or concatenation
with the current node. For the latter case, the associated network equations are:
H
k+1
=
a
β
k
1
T
+
k
H
k
A
H
k
. (13.16)
13.8.3 Mean aggregation
The above methods aggregate the neighbors by summing the node embeddings. However,
it’s possible to combine the embeddings in dierent ways. Sometimes it’s better to take
the average of the neighbors rather than the sum; this can be superior if the embedding
information is more important and the structural information less so since the magnitude
of the neighborhood contributions will not depend on the number of neighbors:
agg[n] =
1
|ne[n]|
X
mne[n]
h
m
, (13.17)
where as before, ne[n] denotes a set containing the indices of the neighbors of the n
th
node. Equation 13.17 can be computed neatly in matrix form by introducing the diago-
nal N ×N degree matrix D. Each non-zero element of this matrix contains the number
of neighbors for the associated node. It follows that each diagonal element in the inverse
Problem 13.8
Draft: please send errata to udlbookmail@gmail.com.
258 13 Graph neural networks
matrix D
1
contains the denominator that we need to compute the average. The new
GCN layer can be written as:
H
k+1
= a
h
β
k
1
T
+
k
H
k
(AD
1
+ I)
i
. (13.18)
13.8.4 Kipf normalization
There are many variations of graph neural networks based on mean aggregation. Some-
times the current node is included with its neighbors in the mean computation rather
than treated separately. In Kipf normalization, the sum of the node representations is
Problem 13.9
normalized as:
agg[n] =
X
mne[n]
h
m
p
|ne[n]||ne[m]|
, (13.19)
with the logic that information coming from nodes with a very large number of neighbors
should be down-weighted since there are many connections and they provide less unique
information. This can also be expressed in matrix form using the degree matrix:
H
k+1
= a
h
β
k
1
T
+
k
H
k
(D
1/2
AD
1/2
+ I)
i
. (13.20)
13.8.5 Max pooling aggregation
An alternative operation that is also invariant to permutation is computing the maximum
of a set of objects. The max pooling aggregation operator is:
agg[n] = max
mne[n]
h
m
, (13.21)
where the operator max[] returns the element-wise maximum of the vectors h
m
that
are neighbors to the current node n.
13.8.6 Aggregation by attention
The aggregation methods discussed so far either weight the contribution of the neighbors
equally or in a way that depends on the graph topology. Conversely, in graph attention
layers, the weights depend on the data at the nodes. A linear transform is applied to
the current node embeddings so that:
H
k
= β
k
1
T
+
k
H
k
. (13.22)
Then the similarity s
mn
of each transformed node embedding h
m
to the transformed
node embedding h
n
is computed by concatenating the pairs, taking a dot product with
a column vector ϕ
k
of learned parameters, and applying an activation function:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
13.8 Layers for graph convolutional networks 259
Figure 13.12 Comparison of graph convolutional network, dot product attention,
and graph attention network. In each case, the mechanism maps N embeddings
of size D stored in a D × N matrix X to an output of the same size. a) The
graph convolutional network applies a linear transformation X
= ΩX to the
data matrix. It then computes a weighted sum of the transformed data, where
the weighting is based on the adjacency matrix. A bias β is added, and the result
is passed through an activation function. b) The outputs of the self-attention
mechanism are also weighted sums of the transformed inputs, but this time the
weights depend on the data itself via the attention matrix. c) The graph attention
network combines both of these mechanisms; the weights are both computed from
the data and based on the adjacency matrix.
Draft: please send errata to udlbookmail@gmail.com.
260 13 Graph neural networks
s
mn
= a
ϕ
T
k
h
m
h
n

. (13.23)
These variables are stored in an N ×N matrix S, where each element represents the
similarity of every node to every other. As in dot-product self-attention, the attention
weights contributing to each output embedding are normalized to be positive and sum
to one using the softmax operation. However, only those values corresponding to the
current node and its neighbors should contribute. The attention weights are applied to
the transformed embeddings:
H
k+1
= a
h
H
k
· Softmask[S, A + I]
i
, (13.24)
where a[] is a second activation function. The function Softmask[, ] computes the
attention values by applying softmax operation separately to each column of its rst
argument S, but only after setting values where the second argument A + I is zero to
negative innity, so they do not contribute. This ensures that the attention to non-
neighboring nodes is zero.
This is very similar to the self-attention computation in transformers (see gure 13.12),
Notebook 13.4
Graph
attention
except that (i) The keys, queries, and values are all the same, (ii) The measure of simi-
larity is dierent, and (iii) The attentions are masked so that each node only attends to
Problem 13.10
itself and its neighbors. As in transformers, this system can be extended to use multiple
heads that are run in parallel and recombined.
13.9 Edge graphs
Until now, we have focused on processing node embeddings. These evolve as they are
passed through the network so that by the end of the network, they represent both the
node and its context in the graph. We now consider the case where the information is
associated with the edges of the graph.
It is easy to adapt the machinery for node embeddings to process edge embeddings
using the edge graph (also known as the adjoint graph or line graph). This is a com-
plementary graph, in which each edge in the original graph becomes a node, and every
two edges with a common node in the original graph create an edge in the new graph
(gure 13.13). In general, a graph can be recovered from its edge graph, so it’s possible
to swap between these two representations.
Problems 13.11–13.13
To process edge embeddings, the graph is translated to its edge graph. Then we
use exactly the same techniques, aggregating information at each new node from its
neighbors and combining this with the current representation. When both node and
edge embeddings are present, we can translate back and forth between the two graphs.
Now there are four possible updates (nodes update nodes, nodes update edges, edges
update nodes, and edges update edges), and these can be alternated as desired, or with
Problem 13.14
minor modications, nodes can be updated simultaneously from both nodes and edges.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
13.10 Summary 261
Figure 13.13 Edge graph. a) Graph with six nodes. b) To create the edge graph,
we assign one node for each original edge (cyan circles), and c) connect the new
nodes if the edges they represent connect to the same node in the original graph.
13.10 Summary
Graphs consist of a set of nodes, where pairs of these nodes are connected by edges. Both
nodes and edges can have data attached, and these are referred to as node embeddings
and edge embeddings, respectively. Many real-world problems can be framed in terms of
graphs, where the goal is to establish a property of the entire graph, properties of each
node or edge, or the presence of additional edges in the graph.
Graph neural networks are deep learning models that are applied to graphs. Since the
node order in graphs is arbitrary, the layers of graph neural networks must be equivariant
to permutations of the node indices. Spatial-based convolutional networks are a family
of graph neural networks that aggregate information from the neighbors of a node and
then use this to update the node embeddings.
One challenge of processing graphs is that they often occur in the transductive setting,
where there is only one partially labeled graph rather than sets of training and test
graphs. This graph can be extremely large, which adds further challenges in terms of
training and has led to sampling and partitioning algorithms. The edge graph has a
node for every edge in the original graph. By converting to this representation, graph
neural networks can be used to update the edge embeddings.
Notes
Sanchez-Lengeling et al. (2021) and Daigavane et al. (2021) present good introductory articles
on graph processing using neural networks. Recent surveys of research in graph neural networks
can be found in articles by Zhou et al. (2020a), Wu et al. (2020c), and Veličković (2023), and
the books of Hamilton (2020) and Ma & Tang (2021). GraphEDM (Chami et al., 2020) unies
Draft: please send errata to udlbookmail@gmail.com.
262 13 Graph neural networks
many existing graph algorithms into a single framework. In this chapter, we have related graphs
to convolutional networks following Bruna et al. (2013), but there are also strong connections
with belief propagation (Dai et al., 2016) and graph isomorphism tests (Hamilton et al., 2017a).
Zhang et al. (2019c) provide a review focusing specically on graph convolutional networks.
Bronstein et al. (2021) provide a general overview of geometric deep learning, including learning
on graphs. Loukas (2020) discusses what types of functions graph neural networks can learn.
Applications: Applications include graph classication (e.g., Zhang et al., 2018b), node clas-
sication (e.g., Kipf & Welling, 2017), edge prediction (e.g., Zhang & Chen, 2018), graph clus-
tering (e.g., Tsitsulin et al., 2020), and recommender systems (e.g., Wu et al., 2023). Methods
for node classication are reviewed by Xiao et al. (2022a), methods for graph classication by
Errica et al. (2019), and methods for edge prediction by Mutlu et al. (2020) and Kumar et al.
(2020a).
Graph neural networks: Graph neural networks were introduced by Gori et al. (2005) and
Scarselli et al. (2008), who formulated them as a generalization of recursive neural networks.
The latter model used the iterative update:
h
n
f
x
n
, x
mne[n]
, e
enee[n]
, h
mne[n]
, ϕ
, (13.25)
in which each node embedding h
n
is updated from the initial embedding x
n
, initial embed-
dings x
mne[n]
at the adjacent nodes, initial embeddings e
enee[n]
at the adjacent edges, and
adjacent node embeddings h
mne[n]
. For convergence, the function f[, , , , ϕ] must be a
contraction mapping (see gure 16.9). If we unroll this equation in time for K steps and allow
dierent parameters ϕ
k
at each time K, then equation 13.25 becomes similar to the graph con-
volutional network. Subsequent work extended graph neural networks to use gated recurrent
units (Li et al., 2016b) and long short-term memory networks (Selsam et al., 2019).
Spectral methods: Bruna et al. (2013) applied the convolution operation in the Fourier
domain. The Fourier basis vectors can be found by taking the eigendecomposition of the graph
Laplacian matrix, L = D A where D is the degree matrix and A is the adjacency matrix.
This has disadvantages: the lters are not localized, and the decomposition is prohibitively
expensive for large graphs. Hena et al. (2015) tackled the rst problem by forcing the Fourier
representation to be smooth (and hence the spatial domain to be localized). Deerrard et al.
(2016) introduced ChebNet, which approximates the lters eciently by using the recursive
properties of Chebyshev polynomials. This both provides spatially localized lters and reduces
the computation. Kipf & Welling (2017) simplied this further to construct lters that use only
a 1-hop neighborhood, resulting in a formulation similar to the spatial methods described in
this chapter and providing a bridge between spectral and spatial methods.
Spatial methods: Spectral methods are ultimately based on the Graph Laplacian, so if the
graph changes, the model must be retrained. This problem spurred the development of spatial
methods. Duvenaud et al. (2015) dened convolutions in the spatial domain, using a dierent
weight matrix to combine the adjacent embeddings for each node degree. This has the disad-
vantage that it becomes impractical if some nodes have a very large number of connections.
Diusion convolutional neural networks (Atwood & Towsley, 2016) use powers of the normal-
ized adjacency matrix to blend features across dierent scales, sum these, pointwise multiply by
weights, and pass through an activation function to create the node embeddings. Gilmer et al.
(2017) introduced message-passing neural networks, which dened convolutions on the graph
as propagating messages from spatial neighbors. The “aggregate and combine” formulation of
GraphSAGE (Hamilton et al., 2017a) ts into this framework.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 263
Aggregate and combine: Graph convolutional networks (Kipf & Welling, 2017) take a
weighted average of the neighbors and current node and then apply a linear mapping and
ReLU. GraphSAGE (Hamilton et al., 2017a) applies a neural network layer to each neighbor,
taking the elementwise maximum to aggregate. Chiang et al. (2019) propose diagonal enhance-
ment in which the previous embedding is weighted more than the neighbors. Kipf & Welling
(2017) introduced Kipf normalization, which normalizes the sum of the neighboring embeddings
based on the degrees of the current node and its neighbors (see equation 13.19).
The mixture model network or MoNet (Monti et al., 2017) takes this one step further by learning
a weighting based on the degrees of the current node and the neighbor. They associate a pseudo-
coordinate system with each node, where the positions of the neighbors depend on these two
quantities. They then learn a continuous function based on a mixture of Gaussians and sample
this at the pseudo-coordinates of the neighbors to get the weights. In this way, they can learn
the weightings for nodes and neighbors with arbitrary degrees. Pham et al. (2017) use a linear
interpolation of the node embedding and neighbors with a dierent weighted combination for
each dimension. The weight of this gating mechanism is generated as a function of the data.
Higher-order convolutional layers: Zhou & Li (2017) used higher-order convolutions by
replacing the adjacency matrix A with
˜
A = Min[A
L
+ I, 1] where L is the maximum walk-
length, 1 is a matrix containing only ones, and Min[] takes the pointwise minimum of its two
matrix arguments; the updates now sum together contributions from any nodes where there is
at least one walk of length L. Abu-El-Haija et al. (2019) proposed MixHop, which computes
node updates from the neighbors (using the adjacency matrix A), the neighbors of the neighbors
(using A
2
), and so on. They concatenate these updates at each layer. Lee et al. (2018) combined
information from nodes beyond the immediate neighbors using geometric motifs, which are small
local geometric patterns in the graph (e.g., a fully connected clique of ve nodes).
Residual connections: Kipf & Welling (2017) proposed a residual connection in which the
original embeddings are added to the updated ones. Hamilton et al. (2017b) concatenate the
previous embedding to the output of the next layer (see equation 13.16). Rossi et al. (2020)
present an inception-style network where the node embedding is concatenated to not only the
aggregation of its neighbors but also the aggregation of all neighbors within a walk of two (via
computing powers of the adjacency matrix). Xu et al. (2018) introduced jump knowledge con-
nections in which the nal output at each node consists of the concatenated node embeddings
throughout the network. Zhang & Meng (2019) present a general formulation of residual em-
beddings called GResNet and investigate several variations in which the embeddings from the
previous layer are added, the input embeddings are added, or versions of these that aggregate
information from their neighbors (without further transformation) are added.
Attention in graph neural networks: Veličković et al. (2019) developed the graph attention
network (gure 13.12c). Their formulation uses multiple heads whose outputs are combined
symmetrically. Gated Attention Networks (Zhang et al., 2018a) weight the output of the dierent
heads in a way that depends on the data itself. Graph-BERT (Zhang et al., 2020) performs
node classication using self-attention alone; the graph’s structure is captured by adding position
embeddings to the data, similarly to how the absolute or relative position of words is captured
in the transformer (chapter 12). For example, they add positional information that depends on
the number of hops between nodes in the graph.
Permutation invariance: In DeepSets, Zaheer et al. (2017) presented a general permutation
invariant operator for processing sets. Janossy pooling (Murphy et al., 2018) accepts that many
functions are not permutation equivariant and instead uses a permutation-sensitive function
and averages the results across many permutations.
Draft: please send errata to udlbookmail@gmail.com.
264 13 Graph neural networks
Edge graphs: The notation of the edge graph, line graph, or adjoint graph dates to Whitney
(1932). The idea of “weaving” layers that update node embeddings from node embeddings,
node embeddings from edge embeddings, edge embeddings from edge embeddings, and edge
embeddings from node embeddings was proposed by Kearnes et al. (2016). However, here the
node-node and edge-edge updates do not involve the neighbors. Monti et al. (2018) introduced
the dual-primal graph CNN, a modern formulation in a CNN framework that alternates between
updates in the original and edge graphs.
Power of graph neural networks: Xu et al. (2019) argue that a neural network should
be able to distinguish dierent graph structures; it is undesirable to map two graphs to the
same output if they have the same initial node embeddings but dierent adjacency matrices.
They identied graph structures that could not be distinguished by previous approaches such
as GCNs (Kipf & Welling, 2017) and GraphSAGE (Hamilton et al., 2017a). They developed a
more powerful architecture with the same discriminative power as the Weisfeiler-Lehman graph
isomorphism test (Weisfeiler & Leman, 1968), which is known to discriminate a broad class of
graphs. This resulting graph isomorphism network was based on the aggregation operation:
h
(n)
k+1
= mlp
(1 + ϵ
k
) h
(n)
k
+
X
mne[n]
h
(m)
k
. (13.26)
Batches: The original paper on graph convolutional networks (Kipf & Welling, 2017) used full-
batch gradient descent. This has memory requirements proportional to the number of nodes,
embedding size, and number of layers during training. Since then, three types of methods
have been proposed to reduce the memory requirements and create batches for SGD in the
transductive setting: node sampling, layer sampling, and sub-graph sampling.
Node sampling methods start by randomly selecting a subset of target nodes and then work
back through the network, adding a subset of the nodes in the receptive eld at each stage.
GraphSAGE (Hamilton et al., 2017a) proposed a xed number of neighborhood samples as
in gure 13.10b. Chen et al. (2018b) introduce a variance reduction technique, but this uses
historical activations of nodes and so still has a high memory requirement. PinSAGE (Ying
et al., 2018a) uses random walks from the target nodes and chooses the K nodes with the highest
visit count. This prioritizes ancestors that are more closely connected.
Node sampling still requires increasing numbers of nodes as we pass back through the graph.
Layer sampling methods address this by directly sampling the receptive eld in each layer
independently. Examples of layer sampling include FastGCN (Chen et al., 2018a), adaptive
sampling (Huang et al., 2018b), and layer-dependent importance sampling (Zou et al., 2019).
Subgraph sampling methods randomly draw subgraphs or divide the original graph into sub-
graphs. These are then trained as independent data examples. Examples of these approaches
include GraphSAINT (Zeng et al., 2020), which samples sub-graphs during training using ran-
dom walks and then runs a full GCN on the subgraph while also correcting for the bias and
variance of the minibatch. Cluster GCN (Chiang et al., 2019) partitions the graph into clusters
(by maximizing the embedding utilization or number of within-batch edges) in a pre-processing
stage and randomly selects clusters to form minibatches. To create more randomness, they train
random subsets of these clusters plus the edges between them (see gure 13.11).
Wolfe et al. (2021) proposed a distributed training method that both partitions the graph and
trains narrower GCNs in parallel by partitioning the feature space at dierent layers. More
information about sampling graphs can be found in Rozemberczki et al. (2020).
Regularization and normalization: Rong et al. (2020) proposed DropEdge, which randomly
drops edges from the graph during each training iteration by masking the adjacency matrix. This
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 265
can be done for the whole neural network or dierently in each layer (layer-wise DropEdge). In a
sense, this is similar to dropout in that it breaks connections in the ow of data, but it can also be
considered an augmentation method since changing the graph is similar to perturbing the data.
Schlichtkrull et al. (2018), Teru et al. (2020), and Veličković et al. (2019) also proposed randomly
dropping edges from the graph as a form of regularization similar to dropout. Node sampling
methods (Hamilton et al., 2017a; Huang et al., 2018b; Chen et al., 2018a) can also be considered
regularizers. Hasanzadeh et al. (2020) present a general framework called DropConnect that
unies many of the above approaches.
There are also many proposed normalization schemes for graph neural networks, including
PairNorm (Zhao & Akoglu, 2020), weight normalization (Oono & Suzuki, 2019), dierentiable
group normalization (Zhou et al., 2020b), and GraphNorm (Cai et al., 2021).
Multi-relational graphs: Schlichtkrull et al. (2018) proposed a variation of graph convolu-
tional networks for multi-relational graphs (i.e., graphs with more than one edge type). Their
scheme separately aggregates information from each edge type using dierent parameters. If
there are many edge types, the number of parameters may become large, and to combat this,
they propose that each edge type uses a dierent weighting of a basis set of parameters.
Hierarchical representations and pooling: CNNs for image classication gradually de-
crease the representation size but increase the number of channels as the network progresses.
However, the GCNs for graph classication in this chapter maintain the entire graph until the
last layer and then combine all the nodes to compute the nal prediction. Ying et al. (2018b)
proposed DiPool, which clusters graph nodes to make a graph that gets progressively smaller
as the depth increases in a way that is dierentiable, and so can be learned. This can be done
based on the graph structure alone or adaptively based on the graph structure and the embed-
dings. Other pooling methods include SortPool (Zhang et al., 2018b) and self-attention graph
pooling (Lee et al., 2019). A comparison of pooling layers for graph neural networks can be
found in Grattarola et al. (2022). Gao & Ji (2019) propose an encoder-decoder structure for
graphs based on the U-Net (see gure 11.10).
Geometric graphs: The MoNet model (Monti et al., 2017) can exploit geometric information
because neighboring nodes have well-dened spatial positions. They learn a mixture of Gaus-
sians function and sample from this based on the relative coordinates of the neighbor. In this
way, they can weight neighboring nodes based on their relative positions as in standard convolu-
tional neural networks, even though these positions are not constant. The geodesic CNN (Masci
et al., 2015) and anisotropic CNN (Boscaini et al., 2016) both adapt convolution to manifolds
(i.e., surfaces) as represented by triangular meshes. They locally approximate the surface as a
plane and dene a coordinate system on this plane around the current node.
Oversmoothing and suspended animation: Unlike other deep learning models, graph neu-
ral networks did not, until recently, benet signicantly from increasing depth. Indeed, the orig-
inal GCN paper (Kipf & Welling, 2017) and GraphSAGE (Hamilton et al., 2017a) both only
use two layers, and Chiang et al. (2019) trained a ve-layer Cluster-GCN to get state-of-the-art
performance on the PPI dataset. One possible explanation is over-smoothing (Li et al., 2018c);
at each layer, the network incorporates information from a larger neighborhood, and it may
be that this ultimately results in the dissolution of (important) local information. Indeed (Xu
et al., 2018) prove that the inuence of one node on another is proportional to the probability
of reaching that node in a K-step random walk. This approaches the stationary distribution of
walks over the graph with increasing K, causing the local neighborhood to be washed out.
Alon & Yahav (2021) proposed another explanation for why performance doesn’t improve with
network depth. They argue that adding depth allows information to be aggregated from longer
paths. However, in practice, the exponential growth in the number of neighbors means there is
a bottleneck whereby too much information is “squashed” into the xed-size node embeddings.
Draft: please send errata to udlbookmail@gmail.com.
266 13 Graph neural networks
Figure 13.14 Graphs for problems 13.1, 13.3, and 13.8.
Ying et al. (2018a) also note that when the depth of the network exceeds a certain limit, the
gradients no longer propagate back, and learning fails for both the training and test data. They
term this eect suspended animation. This is similar to when many layers are naïvely added to
convolutional neural networks (gure 11.2). They propose a family of residual connections that
allow deeper networks to be trained. Vanishing gradients (section 7.5) have also been identied
as a limitation by Li et al. (2021b).
It has recently become possible to train deeper graph neural networks using various forms of
residual connection (Xu et al., 2018; Li et al., 2020a; Gong et al., 2020; Chen et al., 2020b; Xu
et al., 2021a). Li et al. (2021a) train a state-of-the-art model with more than 1000 layers using
an invertible network to reduce the memory requirements of training (see chapter 16).
Problems
Problem 13.1 Write out the adjacency matrices for the two graphs in gure 13.14.
Problem 13.2
Draw graphs that correspond to the following adjacency matrices:
A
1
=
0 1 1 0 0 0 0
1 0 0 1 1 1 0
1 0 0 0 0 1 1
0 1 0 0 0 1 1
0 1 0 0 0 0 1
0 1 1 1 0 0 0
0 0 1 1 1 0 0
and A
2
=
0 0 1 1 0 0 1
0 0 1 1 1 0 0
1 1 0 0 0 0 0
1 1 0 0 1 1 1
0 1 0 1 0 0 1
0 0 0 1 0 0 1
1 0 0 1 1 1 0
.
Problem 13.3
Consider the two graphs in gure 13.14. How many ways are there to walk from
node one to node two in (i) three steps and (ii) seven steps?
Problem 13.4 The diagonal of A
2
in gure 13.4c contains the number of edges that connect to
each corresponding node. Explain this phenomenon.
Problem 13.5 What permutation matrix is responsible for the transformation between the
Appendix B.4.4
Permutation
matrix
graphs in gures 13.5a–c and gure 13.5d–f?
Problem 13.6 Prove that:
sig [β
K
+ ω
K
H
K
1] = sig [β
K
+ ω
K
H
K
P1] , (13.27)
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 267
Figure 13.15 Graphs for problems 13.11–13.13.
where P is an N ×N permutation matrix (a matrix that is all zeros except for exactly one entry
in each row and each column, which is one), and 1 is an N × 1 vector of ones.
Problem 13.7
Consider the simple GNN layer:
H
k+1
= GraphLayer[H
k
, A]
= a
β
k
1
T
+
k
H
k
H
k
A

, (13.28)
where H is a D ×N matrix containing the N node embeddings in its columns, A is the N ×N
adjacency matrix, β is the bias vector, and is the weight matrix. Show that this layer is
equivariant to permutations of the node order so that:
GraphLayer[H
k
, A]P = GraphLayer[H
k
P, P
T
AP], (13.29)
where P is an N × N permutation matrix.
Problem 13.8 What is the degree matrix D for each graph in gure 13.14?
Problem 13.9 The authors of GraphSAGE (Hamilton et al., 2017a) propose a pooling method
in which the node embedding is averaged together with its neighbors so that:
agg[n] =
1
1 + |ne[n]|
h
n
+
X
mne[n]
h
m
. (13.30)
Show how this operation can be computed simultaneously for all node embeddings in the D ×N
embedding matrix H using linear algebra. You will need to use both the adjacency matrix A
and the degree matrix D.
Problem 13.10
Devise a graph attention mechanism based on dot-product self-attention and
draw its mechanism in the style of gure 13.12.
Problem 13.11
Draw the edge graph associated with the graph in gure 13.15a.
Problem 13.12
Draw the node graph corresponding to the edge graph in gure 13.15b.
Problem 13.13 For a general undirected graph, describe how the adjacency matrix of the node
graph relates to the adjacency matrix of the corresponding edge graph.
Problem 13.14
Design a layer that updates a node embedding h
n
based on its neighboring
node embeddings {h
m
}
mne[n]
and neighboring edge embeddings {e
m
}
mnee[n]
. You should
consider the possibility that the edge embeddings are not the same size as the node embeddings.
Draft: please send errata to udlbookmail@gmail.com.
Chapter 14
Unsupervised learning
Chapters 2–9 walked through the supervised learning pipeline. We dened models that
mapped observed data x to output values y and introduced loss functions that measured
the quality of that mapping for a training dataset {x
i
, y
i
}. Then we discussed how
to t and measure the performance of these models. Chapters 10–13 introduced more
sophisticated model architectures incorporating parameter sharing and allowing parallel
computational paths.
The dening characteristic of unsupervised learning models is that they are learned
from a set of observed data {x
i
} in the absence of labels. All unsupervised models share
this property, but they have diverse goals. They may be used to generate plausible new
samples from the dataset or to manipulate, denoise, interpolate between, or compress
examples. They can also be used to reveal the internal structure of a dataset (e.g., by
dividing it into coherent clusters) or to distinguish whether new examples belong to the
same dataset or are outliers.
This chapter introduces a taxonomy of unsupervised learning models and then dis-
cusses the desirable properties of models and how to measure their performance. The
four subsequent chapters discuss four particular models: generative adversarial networks
(GANs), variational autoencoders (VAEs), normalizing ows, and diusion models.
1
14.1 Taxonomy of unsupervised learning models
A common strategy in unsupervised learning is to dene a mapping between the data
examples x and a set of unseen latent variables z. These latent variables capture un-
derlying structure in the dataset and usually have a lower dimension than the original
data; in this sense, a latent variable z can be considered a compressed version of a data
example x that captures its essential qualities (gures 1.9–1.10).
In principle, the mapping between the observed and latent variables can be in either
direction. Some models map from the data x to latent variables z. For example, the
1
Until this point, almost all of the relevant math has been embedded in the text. However, the
following four chapters require a solid knowledge of probability. Appendix C covers the relevant material.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
14.2 What makes a good generative model? 269
Figure 14.1 Taxonomy of unsupervised
learning models. Unsupervised learning
refers to any model trained on datasets
without labels. Generative models can
synthesize (generate) new examples with
similar statistics to the training data. A
subset of these are probabilistic and de-
ne a distribution over the data. We
draw samples from this distribution to
generate new examples. Latent vari-
able models dene a mapping between
an underlying explanatory (latent) vari-
able and the data. They may fall into
any of the above categories.
famous k-means algorithm maps the data x to a cluster assignment z {1, 2, . . . , K}.
Other models map from the latent variables z to the data x. Consider dening a dis-
tribution P r(z) over the latent variable z in these models. New examples can now be
generated by (i) drawing from this distribution and (ii) mapping the sample to the data
space x. Accordingly, these are termed generative models (see gure 14.1).
The four models in chapters 15 to 18 are all generative models that use latent vari-
ables. Generative adversarial networks (chapter 15) learn to generate data examples x
from latent variables z, using a loss that encourages the generated samples to be indis-
tinguishable from real examples (gure 14.2a).
Normalizing ows, variational autoencoders, and diusion models (chapters 16–18)
are probabilistic generative models. In addition to generating new examples, they assign a
probability P r(x|ϕ) to each data point x. This will depend on the model parameters ϕ,
and in training, we maximize the probability of the observed data {x
i
}, so the loss is
the sum of the negative log-likelihoods (gure 14.2b):
L[ϕ] =
I
X
i=1
log
h
P r(x
i
|ϕ)
i
. (14.1)
Since probability distributions must sum to one, this implicitly reduces the probability
of examples that lie far from the observed data. As well as providing a training criterion,
assigning probabilities is useful in its own right; the probability on a test set can be
used to compare two models quantitatively, and the probability for an example can be
thresholded to determine if it belongs to the same dataset or is an outlier.
2
14.2 What makes a good generative model?
Generative models based on latent variables should have the following properties:
2
Note that not all probabilistic generative models rely on latent variables. The transformer decoder
(section 12.7) was learned without labels, can generate new examples, and can assign a probability to
these examples but is based on an autoregressive formulation (equation 12.15).
Draft: please send errata to udlbookmail@gmail.com.
270 14 Unsupervised learning
Figure 14.2 Fitting generative models a) Generative adversarial models provide
a mechanism for generating samples (orange points). As training proceeds (left
to right), the loss function encourages these samples to become progressively less
distinguishable from real examples (cyan points). b) Probabilistic models (in-
cluding variational autoencoders, normalizing ows, and diusion models) learn
a probability distribution over the training data. As training proceeds (left to
right), the likelihood of the real examples increases under this distribution, which
can be used to draw new samples and assess the probability of new data points.
Ecient sampling: Generating samples from the model should be computation-
ally inexpensive and take advantage of the parallelism of modern hardware.
High-quality sampling: The samples should be indistinguishable from the real
data with which the model was trained.
Coverage: Samples should represent the entire training distribution. It is insuf-
cient to generate samples that all look like a subset of the training examples.
Well-behaved latent space: Every latent variable z corresponds to a plausible
data example x. Smooth changes in z correspond to smooth changes in x.
Disentangled latent space: Manipulating each dimension of z should correspond
to changing an interpretable property of the data. For example, in a model of
language, it might change the topic, tense, or verbosity.
Ecient likelihood computation: If the model is probabilistic, we would like
to be able to calculate the probability of new examples eciently and accurately.
This naturally leads to the question of whether the generative models that we consider
satisfy these properties. The answer is subjective, but gure 14.3 provides guidance.
The precise assignments are disputable, but most practitioners would agree that there is
no single model that satises all of these characteristics.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
14.3 Quantifying performance 271
Model Ecient Sample Coverage Well-behaved Disentangled Ecient
quality latent space latent space likelihood
GANs 7 ? n/a
VAEs 7 ? ? 7
Flows 7 ? ?
Diusion 7 ? 7 7 7
Figure 14.3 Properties of four generative models. Neither generative adversarial
networks (GANs), variational autoencoders (VAEs), normalizing ows (Flows),
nor diusion models (diusion) have the full complement of desirable properties.
14.3 Quantifying performance
The previous section discussed the desirable properties of generative models. We now
consider quantitative measures of success for generative models. Much experimentation
with generative models has used images due to the widespread availability of that data
and the ease of qualitatively judging the samples. Consequently, some of these metrics
only apply to images.
Test likelihood: One way to compare probabilistic models is to measure their likelihood
for a test dataset. It is ineective to measure the training data likelihood because a model
could assign a very high probability to each training point and very low probabilities in
between. This model would have a very high training likelihood but could only reproduce
the training data. The test likelihood captures how well the model generalizes from the
training data and also the coverage; if the model assigns a high probability to just a
subset of the training data, it must assign lower probabilities elsewhere, so a portion of
the test examples will have low probability.
Test likelihood is a sensible way to quantify probabilistic models, but unfortunately,
it is not relevant for generative adversarial models (which do not assign a probability)
and is expensive to estimate for variational autoencoders and diusion models (although
it is possible to compute a lower bound on the log-likelihood). Normalizing ows are the
only type of model for which the likelihood can be computed exactly and eciently.
Inception score: The inception score (IS) is specialized for images and ideally for gen-
erative models trained on the ImageNet database. The score is calculated using a pre-
trained classication model usually the “Inception” model, from which the name is
derived. It is based on two criteria. First, each generated image x
should look like one
and only one of the 1000 possible classes y in the ImageNet database. Hence, the prob-
ability distribution P r(y
i
|x
i
) should be highly peaked at the correct class. Second, the
entire set of generated images should be assigned to the classes with equal probability,
so P r(y) should be at when averaged over all generated examples.
Draft: please send errata to udlbookmail@gmail.com.
272 14 Unsupervised learning
Figure 14.4 Inception score. a) A pretrained network classies the generated
images. If the images are realistic, the resulting class probabilities P r(y
i
|x
i
)
should be peaked at the correct class. b) If the model generates all classes equally
frequently, the marginal (average) class probabilities should be at. The inception
score measures the average distance between the distributions in (a) and the
distribution in (b). Images from Deng et al. (2009).
The inception score measures the average distance between these two distributions
over the generated set. This distance will be large if one is peaked and the other at
(gure 14.4). More precisely, it returns the exponential of the expected KL-divergence
Appendix C.5.1
KL divergence
between P r(y
i
|x
i
) and P r(y):
IS = exp
"
1
I
I
X
i=1
D
KL
h
P r(y
i
|x
i
)||P r(y)
i
#
, (14.2)
where I is the number of generated examples and:
P r(y) =
1
I
I
X
i=1
P r(y
i
|x
i
). (14.3)
This metric is only sensible for generative models of the ImageNet database and
is sensitive to the particular classication model; retraining this model can give quite
dierent numerical results. Moreover, it does not reward diversity within an object class;
it returns a high value if the model only generates one realistic example of each class.
Fréchet inception distance: This measure is also intended for images and computes a
symmetric distance between the distributions of generated samples and real examples.
This must be approximate since it is hard to characterize either distribution (indeed,
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
14.4 Summary 273
characterizing the distribution of real examples is the job of generative models in the
rst place). Hence, the Fréchet inception distance approximates both distributions by
Appendix C.5.4
Fréchet distance
multivariate Gaussians and (as the name suggests) estimates the distance between them
using the Fréchet distance.
However, it does not model the distance with respect to the original data but rather
the activations in the deepest layer of the inception classication network. These hidden
units are the ones most associated with object classes, so the comparison occurs at a
semantic level, ignoring the more ne-grained details of the images. This metric does
take account of diversity within classes but relies heavily on the information retained by
the features in the inception network; any information discarded by the network does
not contribute to the result. Some of this discarded information may still be important
to generate realistic samples.
Manifold precision/recall: Fréchet inception distance is sensitive both to the realism
of the samples and their diversity but does not distinguish between these factors. To
disentangle these qualities, we consider the overlap between the data manifold (i.e., the
subset of the data space where the real examples lie) and the model manifold (i.e., where
the generated samples lie). The precision is the fraction of model samples that fall into
the data manifold. This measures the proportion of generated samples that are realistic.
The recall is the fraction of data examples that fall within the model manifold. This
measures the proportion of the real data the model can generate (gure 14.5).
To estimate the manifold, we place a hypersphere around each data example, whose
radius is the distance to the k
th
nearest neighbor. The union of these spheres is an
approximation of the manifold, and it’s easy to determine if a new point lies within it.
This manifold is also typically computed in the feature space of a classier with the
advantages and disadvantages that entails.
14.4 Summary
Unsupervised models learn about the structure of a dataset in the absence of labels. A
subset of these models is generative and can synthesize new data examples. A further
subset is probabilistic in that they can both generate new examples and assign a proba-
bility to observed data. The models considered in the following four chapters start with
a latent variable z which has a known distribution. A deep neural network then maps
from the latent variable to the observed data space. We considered desirable properties
of generative models and introduced metrics that attempt to quantify their performance.
Notes
Popular generative models include generative adversarial networks (Goodfellow et al., 2014),
variational autoencoders (Kingma & Welling, 2014), normalizing ows (Rezende & Mohamed,
Draft: please send errata to udlbookmail@gmail.com.
274 14 Unsupervised learning
Figure 14.5 Manifold precision/recall. a) True distributions of real examples and
samples synthesized by the generative model. b) The overlap can be summarized
by the precision (the proportion of synthesized samples that overlap with the
distribution or manifold of real examples), and c) recall (the proportion of real
examples that overlap with the manifold of the synthesized samples). d) The
manifold of synthesized samples can be approximated by taking the union of a
set of hyperspheres centered on each sample. Here, these have constant radius, but
more commonly, the radius is based on the distance to the k
th
nearest neighbor.
e) The manifold for real examples is approximated similarly. f) The precision can
be computed as the proportion of real examples that lie within the approximated
manifold of samples. Similarly, the recall is computed as the proportion of samples
that lie within the approximated manifold of real examples (not shown). Adapted
from
Kynkäänniemi et al. (2019).
2015), diusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020), autoregressive models
(Bengio et al., 2000; Van den Oord et al., 2016b), and energy-based models (LeCun et al.,
2006). All except energy models are discussed in this book. Bond-Taylor et al. (2022) provide
a recent survey of generative models.
Evaluation: Salimans et al. (2016) introduced the inception score, and Heusel et al. (2017)
introduced the Fréchet inception distance, both of which are based on the Pool-3 layer of the
Inception V3 model (Szegedy et al., 2016). Nash et al. (2021) used earlier layers of the same
network that retain more spatial information to ensure that the spatial statistics of images are
also replicated. Kynkäänniemi et al. (2019) introduced the manifold precision/recall method.
Barratt & Sharma (2018) discuss the inception score in detail and point out its weaknesses.
Borji (2022) discusses the pros and cons of dierent methods for assessing generative models.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Chapter 15
Generative Adversarial Networks
A generative adversarial network or GAN is an unsupervised model that aims to generate
new samples that are indistinguishable from a set of training examples. GANs are just
mechanisms to create new samples; they do not build a probability distribution over the
modeled data and hence cannot evaluate the probability that a new data point belongs
to the same distribution.
In a GAN, the main generator network creates samples by mapping random noise to
the output data space. If a second discriminator network cannot distinguish between the
generated samples and the real examples, the samples must be plausible. If this network
can tell the dierence, this provides a training signal that can be fed back to improve the
quality of the samples. This idea is simple, but training GANs is dicult: the learning
algorithm can be unstable, and although GANs may learn to generate realistic samples,
this does not imply that they learn to generate all possible samples.
GANs have been applied to many types of data, including audio, 3D models, text,
video, and graphs. However, they have found the most success in the image domain,
where they can produce samples that are almost indistinguishable from real pictures.
Accordingly, the examples in this chapter focus on synthesizing images.
15.1 Discrimination as a signal
We aim to generate new samples {x
j
} that are drawn from the same distribution as a
set of real training data {x
i
}. A single new sample x
j
is generated by (i) choosing a
latent variable z
j
from a simple base distribution (e.g., a standard normal) and then (ii)
passing this data through a network x
j
= g[z
j
, θ] with parameters θ. This network is
known as the generator. During the learning process, the goal is to nd parameters θ so
that the samples {x
j
} look “similar” to the real data {x
i
} (see gure 14.2a).
Similarity can be dened in many ways, but the GAN uses the principle that the
samples should be statistically indistinguishable from the true data. To this end, a
second network f[, ϕ] with parameters ϕ called the discriminator is introduced. This
network aims to classify its input as being a real example or a generated sample. If this
Draft: please send errata to udlbookmail@gmail.com.
276 15 Generative Adversarial Networks
Figure 15.1 GAN mechanism. a) Given a parameterized function (a generator)
that synthesizes samples (orange arrows) and a batch of real examples (cyan
arrows), we train a discriminator to distinguish the real examples from the gen-
erated samples (sigmoid curve indicates the estimated probability that the data
point is real). b) The generator is trained by modifying its parameters so that the
discriminator becomes less condent the samples were synthetic (in this case, by
moving the orange samples to the right). The discriminator is then updated. c)
Alternating updates to the generator and discriminator cause the generated sam-
ples to become indistinguishable from real examples and the impetus to change
the generator (i.e., the slope of the sigmoid function) to diminish.
proves impossible, the generated samples are indistinguishable from the real examples,
and we have succeeded. If it is possible, the discriminator provides a signal that can be
used to improve the generation process.
Figure 15.1 illustrates this scheme. We start with a training set
{
x
i
}
of real 1D
examples. A dierent batch of ten of these examples {x
i
}
10
i=1
is shown in each panel
(cyan arrows). To create a batch of samples {x
j
}, we use the simple generator:
x
j
= g[z
j
, θ] = z
j
+ θ, (15.1)
where latent variables {z
j
} are drawn from a standard normal distribution, and the
parameter θ translates the generated samples along the x-axis (gure 15.1).
At initialization, θ = 3.0, and the generated samples (orange arrows) lie to the left of
the real examples (cyan arrows). The discriminator is trained to distinguish the generated
samples from the real examples (the sigmoid curve indicates the probability that a data
point is real). During training, the generator parameters θ are manipulated to increase
the probability that its samples are classied as real. Here, this means increasing θ so
that the samples move rightwards where the sigmoid curve is higher.
We alternate between updating the discriminator and the generator. Figures 15.1b–c
show two iterations of this process. It gradually becomes harder to classify the data,
Notebook 15.1
GAN toy example
so the impetus to change θ becomes weaker (i.e., the sigmoid becomes atter). At the
end of the process, there is no way to distinguish the two sets of data; the discriminator,
which now has chance performance, is discarded, and we are left with a generator that
makes plausible samples.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
15.1 Discrimination as a signal 277
15.1.1 GAN loss function
We now dene the loss function for training GANs more precisely. The discrimina-
tor f[x, ϕ] takes input x, has parameters ϕ, and returns a scalar that is higher when it
believes the input is a real example. This is a binary classication task, so we adapt the
binary cross-entropy loss function (section 5.4), which originally had the form:
ˆ
ϕ = argmin
ϕ
"
X
i
(1 y
i
) log
h
1 sig[f[x
i
, ϕ]]
i
y
i
log
h
sig[f[x
i
, ϕ]]
i
#
, (15.2)
where y
i
{0, 1} is the label, and sig[] is the logistic sigmoid function (gure 5.7).
In this case, we assume that the real examples x have label y = 1 and the generated
samples x
have label y = 0 so that:
ˆ
ϕ = argmin
ϕ
X
j
log
h
1 sig[f[x
j
, ϕ]]
i
X
i
log
h
sig[f[x
i
, ϕ]]
i
, (15.3)
where i and j index the real examples and generated samples, respectively.
Now we substitute the denition for the generator x
j
= g[z
j
, θ] and note that we
must maximize with respect to θ since we want the generated samples to be misclassied
(i.e., have low likelihood of being synthetic or high negative log-likelihood):
ˆ
θ = argmax
θ
min
ϕ
X
j
log
h
1sig[f[g[z
j
, θ], ϕ]]
i
X
i
log
h
sig[f[x
i
, ϕ]]
i
. (15.4)
15.1.2 Training GANs
Equation 15.4 is a more complex loss function than we have seen before; the discrimi-
nator parameters ϕ are manipulated to minimize the loss function, and the generative
parameters θ are manipulated to maximize the loss function. GAN training is character-
ized as a minimax game; the generator tries to nd new ways to fool the discriminator,
which in turn searches for new ways to distinguish generated samples from real examples.
Technically, the solution is a Nash equilibrium the optimization algorithm searches
for a position that is simultaneously a minimum of one function and a maximum of the
other. If training proceeds as planned, then upon convergence, g[z, θ] will be drawn from
the same distribution as the data, and sig[f[, ϕ]] will be at chance (i.e., 0.5).
To train the GAN, we can divide equation 15.4 into two loss functions:
L[ϕ] =
X
j
log
h
1 sig[f[g[z
j
, θ], ϕ]]
i
X
i
log
h
sig[f[x
i
, ϕ]]
i
L[θ] =
X
j
log
h
1 sig[f[g[z
j
, θ], ϕ]]
i
, (15.5)
Draft: please send errata to udlbookmail@gmail.com.
278 15 Generative Adversarial Networks
Figure 15.2 GAN loss functions. A latent variable z
j
is drawn from the base dis-
tribution and passed through the generator to create a sample x
. A batch {x
j
}
of samples and a batch of real examples {x
i
} are passed to the discriminator,
which assigns a probability that each is real. The discriminator parameters ϕ are
modied to assign high probability to the real examples and low probability to
the generated samples. The generator parameters θ are modied to “fool” the
discriminator into assigning the generated samples a high probability.
where we multiplied the second function by minus one to convert to a minimization
problem and dropped the second term, which has no dependence on θ. Minimizing the
rst loss function trains the discriminator. Minimizing the second trains the generator.
At each step, we draw a batch of latent variables z
j
from the base distribution and
pass these through the generator to create samples x
j
= g[z
j
, θ]. Then we choose a
batch of real training examples x
i
. Given the two batches, we can now perform one or
more gradient descent steps on each loss function (gure 15.2).
15.1.3 Deep convolutional GAN
The deep convolutional GAN or DCGAN was an early GAN architecture specialized
for generating images (gure 15.3). The input to the generator g[z, θ] is a 100D latent
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
15.1 Discrimination as a signal 279
...
Figure 15.3 DCGAN architecture. In the generator, a 100D latent variable z is
drawn from a uniform distribution and mapped by a linear transformation to a
4×4 representation with 1024 channels. This is then passed through a series of
convolutional layers that gradually upsample the representation and decrease the
number of channels. At the end is a tanh function that maps the 64×64×3 rep-
resentation to a xed range so that it can represent an image. The discriminator
consists of a standard convolutional net that classies the input as either a real
example or a generated sample.
variable z sampled from a uniform distribution. This is then mapped to a 4×4 spatial
representation with 1024 channels using a linear transformation. Four convolutional
layers follow, each of which uses a fractionally-strided convolution that doubles the res-
olution (i.e., a convolution with a stride of 0.5). At the nal layer, the 64×64×3 signal
is passed through a tanh function to generate an image x
in the range [1, 1]. The
discriminator f[, ϕ] is a standard convolutional network where the nal convolutional
layer reduces the size to 1×1 with one channel. This single number is passed through a
sigmoid function sig[] to create the output probability.
After training, the discriminator is discarded. To create new samples, latent vari-
ables z are drawn from the base distribution and passed through the generator. Example
results are shown in gure 15.4.
15.1.4 Diculty training GANs
Theoretically, the GAN is fairly straightforward. However, GANs are notoriously dicult
to train. For example, to get the DCGAN to train reliably, it was necessary to (i) use
strided convolutions for upsampling and downsampling; (ii) use BatchNorm in both
generator and discriminator except in the last and rst layers, respectively; (iii) use the
leaky ReLU activation function (gure 3.13) in the discriminator; and (iv) use the Adam
optimizer but with a lower momentum coecient than usual. This is unusual. Most
deep learning models are relatively robust to such choices.
Draft: please send errata to udlbookmail@gmail.com.
280 15 Generative Adversarial Networks
Figure 15.4 Synthesized images from the DCGAN model. a) Random samples
drawn from DCGAN trained on a faces dataset. b) Random samples using the
ImageNet database (see gure 10.15). c) Random samples drawn from the LSUN
scene understanding dataset. Adapted from Radford et al. (2015).
Figure 15.5 Mode collapse. Synthesized images from a GAN trained on the LSUN
scene understanding dataset using an MLP generator with a similar number of
parameters and layers to the DCGAN. The samples are low quality, and many
are similar. Adapted from Arjovsky et al. (2017).
A common failure mode is that the generator makes plausible samples, but these
only represent a subset of the data (e.g., for faces, it might never generate faces with
beards). This is known as mode dropping. An extreme version of this phenomenon can
occur where the generator entirely or mostly ignores the latent variables z and collapses
all samples to one or a few points; this is known as mode collapse (gure 15.5).
15.2 Improving stability
To understand why GANs are dicult to train, it’s necessary to understand exactly what
the loss function represents.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
15.2 Improving stability 281
15.2.1 Analysis of GAN loss function
If we divide the two sums in the rst line of equation 15.5 by the numbers I, J of real
and generated samples, then the loss function can be written in terms of expectations:
L[ϕ] =
1
J
J
X
j=1
log
h
1 sig[f[x
j
, ϕ]]
i
1
I
I
X
i=1
log
h
sig[f[x
i
, ϕ]]
i
(15.6)
E
x
log
h
1 sig[f[x
, ϕ]]
i
E
x
log
h
sig[f[x, ϕ]]
i
=
Z
P r(x
) log
h
1 sig[f[x
, ϕ]]
i
dx
Z
P r(x) log
h
sig[f[x, ϕ]]
i
dx,
where P r(x
) is the probability distribution over the generated samples, and P r(x) is
the true probability distribution over the real examples.
When I = J, the optimal discriminator for an example
˜
x of unknown origin is:
P r(real|
˜
x) = sig
f[
˜
x, ϕ]
=
P r
(
˜
x
|
real
)
P r(
˜
x|generated) + Pr(
˜
x|real)
=
P r(x)
P r(x
) + Pr(x)
, (15.7)
where on the right hand side, we evaluate
˜
x against the generated distribution P r(x
)
and the real distribution P r(x). Substituting into equation 15.6, we get:
L[ϕ] =
Z
P r(x
) log
h
1 sig[f[x
, ϕ]]
i
dx
Z
P r(x) log
h
sig[f[x, ϕ]]
i
dx (15.8)
=
Z
P r(x
) log
1
P r(x)
P r(x
) + Pr(x)
dx
Z
P r(x) log
P r(x)
P r(x
) + Pr(x)
dx
=
Z
P r(x
) log
P r(x
)
P r(x
) + Pr(x)
dx
Z
P r(x) log
P r(x)
P r(x
) + Pr(x)
dx.
Disregarding additive and multiplicative constants, this is the Jensen-Shannon divergence
Problems 15.1–15.2
Appendix C.5.2
Jensen-Shannon
divergence
between the synthesized distribution P r(x
) and the true distribution P r(x):
D
JS
h
P r(x
) || P r(x)
i
(15.9)
=
1
2
D
KL
P r(x
)
P r(x
) + Pr(x)
2
+
1
2
D
KL
P r(x)
P r(x
) + Pr(x)
2
=
1
2
Z
P r(x
) log
2P r(x
)
P r(x
) + Pr(x)
dx
| {z }
quality
+
1
2
Z
P r(x) log
2P r(x)
P r(x
) + Pr(x)
dx
| {z }
coverage
.
where D
KL
[•||•] is the Kullback-Leibler divergence.
Appendix C.5.1
Kullback-Leibler
divergence
The rst term indicates the distance will be small if, wherever the sample den-
sity P r(x
) is high, the mixture (P r(x
) + P r(x))/2 has high probability. In other
words, it penalizes regions with samples x
but no real examples x; it enforces quality.
The second term says that the distance will be small if, wherever the true density P r(x)
Draft: please send errata to udlbookmail@gmail.com.
282 15 Generative Adversarial Networks
Figure 15.6 Problem with GAN loss
function. If the generated samples (or-
ange arrows) are easy to distinguish from
the real examples (cyan arrows), then
the discriminator (sigmoid) may have a
very shallow slope at the positions of the
samples; hence, the gradient to update
the parameter of the generator may be
tiny.
is high, the mixture (P r(x
) + P r(x))/2 has high probability. In other words, it pe-
nalizes regions with real examples but no samples. It enforces coverage. Referring to
equation 15.6, we see that the second term does not depend on the generator, which
consequently doesn’t care about coverage; it is happy to generate a subset of possible
examples accurately. This is the putative reason for mode dropping.
15.2.2 Vanishing gradients
In the previous section, we saw that when the discriminator is optimal, the loss function
minimizes a measure of the distance between the generated and real samples. However,
there is a potential problem with using this distance between probability distributions
as the criterion for optimizing GANs. If the probability distributions are completely
disjoint, this distance is innite, and any small change to the generator will not decrease
the loss. The same phenomenon can be seen when we consider the original formulation; if
the discriminator can perfectly separate the generated and real samples, no small change
to the generated data will change the classication score (gure 15.6).
Unfortunately, the distributions of generated samples and real examples may really be
disjoint; the generated samples lie in a subspace that is the size of the latent variable z,
and the real examples also lie in a low-dimensional subspace due to the physical processes
that created the data (gure 1.9). There may be little or no overlap between these
subspaces, and the result is very small or no gradients.
Figure 15.7 provides empirical evidence to support this hypothesis. If the DCGAN
generator is frozen and the discriminator is updated repeatedly so that its classication
performance improves, the generator gradients decrease. In short, there is a very ne
balance between the quality of the discriminator and the generator; if the discriminator
becomes too good, the training updates of the generator are attenuated.
15.2.3 Wasserstein distance
The previous sections showed that (i) the GAN loss can be interpreted in terms of
distances between probability distributions and that (ii) the gradient of this distance
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
15.2 Improving stability 283
Figure 15.7 Vanishing gradients in the
generator of a DCGAN. The generator
is frozen after 1, 10, and 25 epochs,
and the discriminator is trained further.
The gradient of the generator decreases
rapidly (note log scale); if the discrimina-
tor becomes too accurate, the gradients
for the generator vanish. Adapted from
Arjovsky & Bottou (2017).
becomes zero when the generated samples are too easy to distinguish from the real
examples. The obvious way forward is to choose a distance metric with better properties.
The Wasserstein or (for discrete distributions) earth mover’s distance is the quantity
of work required to transport the probability mass from one distribution to create the
other. Here, “work” is dened as the mass multiplied by the distance moved. This imme-
diately sounds more promising; the Wasserstein distance is well-dened even when the
distributions are disjoint and decreases smoothly as they become closer to one another.
15.2.4 Wasserstein distance for discrete distributions
The Wasserstein distance is easiest to understand for discrete distributions (gure 15.8).
Consider distributions P r(x = i) and q(x = j) dened over K bins. Assume there is a
cost C
ij
associated with moving one unit of mass from bin i in the rst distribution to
bin j in the second; this cost might be the absolute dierence |i j| between the indices.
The amounts that are moved form the transport plan and are stored in a matrix P.
The Wasserstein distance is dened as:
D
w
h
P r(x)||q(x)
i
= min
P
X
i,j
P
ij
· |i j|
, (15.10)
subject to the constraints that:
P
j
P
ij
= P r(x = i) initial distribution of P r(x)
P
i
P
ij
=
q
(
x
=
j
)
initial distribution of
q
(
x
)
P
ij
0 non-negative masses.
(15.11)
In other words, the Wasserstein distance is the solution to a constrained minimization
problem that maps the mass of one distribution to the other. This is inconvenient as
we must solve this minimization problem over the elements P
ij
every time we want to
compute the distance. Fortunately, this is a standard problem that is easily solved for
small systems of equations. It is a linear programming problem in its primal form.
Draft: please send errata to udlbookmail@gmail.com.
284 15 Generative Adversarial Networks
Figure 15.8 Wasserstein or earth mover’s distance. a) Consider the discrete dis-
tribution P r(x = i). b) We wish to move the probability mass to create the target
distribution q(x = j). c) The transport plan P identies how much mass will be
moved from i to j. For example, the cyan highlighted square p
54
indicates how
much mass will be moved from i = 5 to j = 4. The elements of the transport
plan must be non-negative, the sum over j must be P r(x = i), and the sum over i
must be q(x = j). Hence P is a joint probability distribution. d) The distance
matrix between elements i and j. The optimal transport plan P minimizes the
sum of the pointwise product of P and the distance matrix (termed the Wasser-
stein distance). Hence, the elements of P tend to lie close to the diagonal where
the distance cost is lowest. Adapted from Hermann (2017).
primal form dual form
minimize
c
T
p
,
such that Ap = b
and p 0
maximize b
T
f,
such that A
T
f c
where p contains the vectorized elements P
ij
that determine the amount of mass moved, c
contains the distances, Ap = b contains the initial distribution constraints, and p 0
Problem 15.3
ensures the masses moved are non-negative.
1
As for all linear programming problems, there is an equivalent dual problem with the
same solution. Here, we maximize with respect to a variable f that is applied to the
Notebook 15.2
Wasserstein
distance
initial distributions, subject to constraints that depend on the distances c. The solution
to this dual problem is:
D
w
h
P r(x)||q(x)
i
= max
f
X
i
P r(x = i)f
i
X
j
q(x = j)f
j
, (15.12)
1
The mathematical background is omitted due to space constraints. Linear programming is a stan-
dard problem with well-known algorithms for nding the minimum.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
15.2 Improving stability 285
subject to the constraint that:
|f
i+1
f
i
| < 1. (15.13)
In other words, we optimize over a new set of variables {f
i
} where adjacent values cannot
change by more than one.
15.2.5 Wasserstein distance for continuous distributions
Translating these results back to the continuous multi-dimensional domain, the equiva-
lent of the primal form (equation 15.10) is:
Problems 15.4–15.5
D
w
h
P r(x), q(x)
i
= min
π [,]
ZZ
π(x
1
, x
2
) · ||x
1
x
2
||dx
1
dx
2
, (15.14)
subject to constraints similar to equation 15.11 on the transport plan π(x
1
, x
2
) rep-
resenting the mass moved from position x
1
to x
2
. The equivalent of the dual form
(equation 15.12) is:
D
w
h
P r(x), q(x)
i
= max
f[x]
Z
P r(x)f[x]dx
Z
q(x)f[x]dx
, (15.15)
subject to the constraint that the Lipschitz constant of the function f[x] is less than one
Appendix B.1.1
Lipschitz constant
(i.e., the absolute gradient of the function is less than one).
15.2.6 Wasserstein GAN loss function
In the context of neural networks, we maximize over the space of functions f[x] by
optimizing the parameters ϕ in a neural network f[x, ϕ], and we approximate these
integrals using generated samples x
i
and real examples x
i
:
L[ϕ] =
X
j
f[x
j
, ϕ]
X
i
f[x
i
, ϕ]
=
X
j
f[g[z
j
, θ], ϕ]
X
i
f[x
i
, ϕ], (15.16)
where we must constrain the neural network discriminator f[x
i
, ϕ] to have an absolute
gradient norm of less than one at every position x:
f[x, ϕ]
x
< 1. (15.17)
One way to achieve this is to clip the discriminator weights to a small range (e.g., ±0.01).
An alternative is the gradient penalty Wasserstein GAN or WGAN-GP, which adds a
regularization term that increases as the gradient norm deviates from unity.
Draft: please send errata to udlbookmail@gmail.com.
286 15 Generative Adversarial Networks
Figure 15.9 Progressive growing. a) The generator is initially trained to create
very small (4×4) images, and the discriminator to identify if these images are
synthesized or downsampled real images. b) After training at this low-resolution
terminates, subsequent layers are added to the generator to generate (8×8) im-
ages. Similar layers are added to the discriminator to downsample back again. c)
This process continues to create (16×16) images and so on. In this way, a GAN
that produces very realistic high-resolution images can be trained. d) Images of
increasing resolution generated at dierent stages from the same latent variable.
Adapted from Wolf (2021), using method of Karras et al. (2018).
15.3 Progressive growing, minibatch discrimination, and truncation
The Wasserstein formulation makes GAN training more stable. However, further ma-
chinery is needed to generate high-quality images. We now review progressive growing,
minibatch discrimination, and truncation, which all improve output quality.
In progressive growing (gure 15.9), we rst train a GAN that synthesizes 4×4 images
using an architecture similar to the DCGAN. Then we add subsequent layers to the
generator, which upsample the representation and perform further processing to create
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
15.3 Progressive growing, minibatch discrimination, and truncation 287
Figure 15.10 Truncation. The quality of
GAN samples can be traded o against
diversity by rejecting samples from the
latent variable z that fall further than τ
standard deviations from the mean. a)
If this threshold is large (τ = 2.0), the
samples are visually varied but may have
defects. b–c) As this threshold is de-
creased, the average visual quality im-
proves, but the diversity decreases. d)
With a very small threshold, the samples
look almost identical. By judiciously
choosing this threshold, it’s possible to
increase the average quality of GAN re-
sults. Adapted from Brock et al. (2019).
Figure 15.11 Progressive growing. This method generates realistic images of faces
when trained on the CELEBA-HQ dataset and more complex, variable objects
when trained on LSUN categories. Adapted from Karras et al. (2018).
Figure 15.12 Traversing latent space of progressive GAN trained on LSUN cars.
Moving in the latent space produces car images that change smoothly. This
usually only works for short trajectories; eventually, the latent variable moves to
somewhere that produces unrealistic images. Adapted from Karras et al. (2018).
Draft: please send errata to udlbookmail@gmail.com.
288 15 Generative Adversarial Networks
an 8×8 image. The discriminator also has extra layers added to it so that it can receive
the higher-resolution images and classify them as either being generated samples or real
examples. In practice, the higher-resolution layers gradually “fade in” over time; initially,
the higher-resolution image is an upsampled version of the previous result, passed via a
residual connection, and the new layers gradually take over.
Mini-batch discrimination ensures that the samples have sucient variety and hence
helps prevent mode collapse. This can be done by computing feature statistics across
the mini-batches of synthesized and real data. These can be summarized and added as a
feature map (usually toward the end of the discriminator). This allows the discriminator
to send a signal back to the generator, encouraging it to include a similar amount of
variation in the synthesized data as in the original dataset.
Another trick to improve generation results is truncation (gure 15.10), in which
only latent variables z with high probability (i.e., close to the mean) are chosen during
sampling. This reduces the variation in the samples but improves their quality. Careful
Problem 15.6
normalization and regularization schemes also improve sample quality. Using combina-
tions of these methods, GANs can synthesize varied and realistic images (gure 15.11).
Moving smoothly through the latent space can also sometimes produce realistic interpo-
lations from one synthesized image to another (gure 15.12).
15.4 Conditional generation
GANs produce realistic images but don’t specify their attributes: we can’t choose the hair
color, ethnicity, or age of faces, without training separate GANs for each combination of
characteristics. Conditional generation models provide us with this control.
15.4.1 Conditional GAN
The conditional GAN passes a vector c of attributes to both the generator and discrimi-
nator, which are now written as g[z, c, θ] and f[x, c, ϕ], respectively. The generator aims
to transform the latent variable z into a data sample x with the correct attribute c. The
discriminator’s goal is to distinguish between (i) the generated sample with the target
attribute or (ii) a real example with the real attribute (gure 15.13a).
For the generator, the attribute c can be appended to the latent vector z. For the
discriminator, it may be appended to the input if the data are 1D. If the data comprise
images, the attribute can be linearly transformed to a 2D representation and appended
as an extra channel to the discriminator input or to one of its intermediate hidden layers.
15.4.2 Auxiliary classier GAN
The auxiliary classier GAN or ACGAN simplies conditional generation by requiring
that the discriminator correctly predicts the attribute (gure 15.13b). For a discrete
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
15.4 Conditional generation 289
Figure 15.13 Conditional generation. a) The generator of the conditional GAN
also receives an attribute vector c describing some aspect of the image. As usual,
the discriminator receives either a real example or a generated sample, but now it
also receives the attribute vector; this encourages the samples both to be realistic
and compatible with the attribute. b) The generator of the auxiliary classier
GAN (ACGAN) takes a discrete attribute variable. The discriminator must both
(i) determine if its input is real or synthetic and (ii) identify the class correctly.
c) The InfoGAN splits the latent variable into noise z and unspecied random
attributes c. The discriminator must distinguish if its input is real and also recon-
struct these attributes. In practice, this means that the variables c correspond to
salient aspects of the data with real-world interpretations (i.e., the latent space
is disentangled).
Draft: please send errata to udlbookmail@gmail.com.
290 15 Generative Adversarial Networks
Figure 15.14 Auxiliary classier GAN. The generator takes a class label as well
as the latent vector. The discriminator must both identify if the data point is
real and predict the class label. This model was trained on ten ImageNet classes.
Left to right: generated examples of monarch butteries, goldnches, daisies,
redshanks, and gray whales. Adapted from Odena et al. (2017).
attribute with C categories, the discriminator takes the real/synthesized image as input
and has C + 1 outputs; the rst is passed through a sigmoid function and predicts if the
sample is generated or real. The remaining outputs are passed through a softmax func-
tion to predict the probability that the data belongs to each of the C classes. Networks
trained with this method can synthesize multiple classes from ImageNet (gure 15.14).
15.4.3 InfoGAN
The conditional GAN and ACGAN both generate samples that have predetermined at-
tributes. By contrast, InfoGAN (gure 15.13c) attempts to identify important attributes
automatically. The generator takes a vector consisting of random noise variables z and
random attribute variables c. The discriminator both predicts whether the image is real
or synthesized and estimates the attribute variables.
The insight is that interpretable real-world characteristics should be easiest to predict
and hence will be represented in the attribute variables c. The attributes in c may be
discrete (and a binary or multiclass cross-entropy loss would be used) or continuous (and
a least squares loss would be used). The discrete variables identify categories in the data,
and the continuous ones identify gradual modes of variation (gure 15.15).
15.5 Image translation
Although the adversarial discriminator was rst used in the context of the GAN for
generating random samples, it can also be used as a prior that favors realism in tasks
that translate one data example into another. This is most commonly done with images,
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
15.5 Image translation 291
Figure 15.15 InfoGAN for MNIST. a) Training examples from the MNIST
database, which consists of 28×28 pixel images of handwritten digits. b) The
rst attribute c
1
is categorical with 10 categories; each column shows samples
generated with one of these categories. The InfoGAN recovers the ten digits.
The attribute vectors c
2
and c
3
are continuous. c) Moving from left to right, each
column represents a dierent value of c
2
while keeping the other latent variables
constant. This attribute seems to correspond to the orientation of the charac-
ter. d) The third attribute seems to correspond to the thickness of the stroke.
Adapted from Chen et al. (2016b).
where we might want to translate a grayscale image to color, a noisy image to a clean
one, a blurry image to a sharp one, or a sketch to a photo-realistic image.
This section discusses three image translation models that use dierent amounts of
manual labeling. The Pix2Pix model uses before/after pairs for training. Models with
adversarial losses use before/after pairs for the main model but also exploit unpaired
“after” images in the discriminator. The CycleGAN model uses unpaired images.
15.5.1 Pix2Pix
The Pix2Pix model (gure 15.16) is a network x = g[c, θ] that maps one image c to a
dierent style image x using a U-Net (gure 11.10) with parameters θ. A typical use
case would be colorization, where the input is grayscale, and the output is color. The
output should be similar to the input, and this is encouraged using a content loss that
penalizes the
1
norm ||x g[c, θ]||
1
between the input and output.
Appendix B.3.2
1
norm
However, the output image should also look like a realistic conversion of the input.
This is encouraged by using an adversarial discriminator f[c, x, ϕ], which ingests the
before and after images c and x. At each step, the discriminator tries to distinguish
between a real before/after pair and a before/synthesized pair. To the extent that these
Draft: please send errata to udlbookmail@gmail.com.
292 15 Generative Adversarial Networks
can be distinguished successfully, a feedback signal is provided to modify the U-Net to
make its output more realistic. Since the content loss ensures that the large-scale image
structure is correct, the discriminator is mainly needed to ensure that the local texture
is plausible. To this end, the PatchGAN loss is based on a purely convolutional classier.
At the last layer, each hidden unit indicates whether the region within its receptive eld
is real or synthesized. These responses are averaged to provide the nal output.
One way to think of this model is that it is a conditional GAN where the U-Net is the
generator and is conditioned on an image rather than a label. Notice, though, that the
U-Net input does not include noise and so is not really a “generator” in the conventional
sense. Interestingly, the original authors experimented with adding noise z to the U-Net
in addition to the input image c. However, the network just learned to ignore it.
15.5.2 Adversarial loss
The discriminator of the Pix2Pix model attempted to distinguish whether before/after
pairs in an image translation task were plausible. This has the disadvantage that we
need ground truth before/after pairs to exploit the discriminator loss. Fortunately, there
is a simpler way to exploit the power of adversarial discriminators in the context of
supervised learning without the need for additional labeled training data.
An adversarial loss adds a penalty if a discriminator can distinguish the output of
a supervised network from a real example from its output domain. Accordingly, the
supervised model changes its predictions to decrease this penalty. This may be done at
the scale of the entire output or at the level of patches, as in the Pix2Pix algorithm. This
helps improve the realism of complex structured outputs. However, it doesn’t necessarily
lead to a better solution in terms of the original loss function.
The super-resolution GAN or SRGAN uses this approach (gure 15.17). The main
model consists of a convolutional network with residual connections that ingests a low-
resolution image and converts this via upsampling layers to a high-resolution image. The
network is trained with three losses. The content loss measures the squared dierence
between the output and the true high-resolution image. The VGG loss or perceptual
loss passes the synthesized and ground truth outputs through the VGG network and
measures the squared dierence between their activations. This encourages the image to
be semantically similar to the target. Finally, the adversarial loss uses a discriminator
that attempts to distinguish whether this is a real high-resolution image or an upsampled
one. This encourages the output to be indistinguishable from real examples.
15.5.3 CycleGAN
The adversarial loss assumes that we have labeled before/after images for the main
supervised network. The CycleGAN addresses the situation where we have two sets of
data with distinct styles but no matching pairs. An example is converting a photo to
the artistic style of Monet. There exist many photos and many Monet paintings, but no
correspondence between them. CycleGAN exploits the idea that converting an image in
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
15.5 Image translation 293
Figure 15.16 Pix2Pix model. a) The model translates an input image to a pre-
diction in a dierent style using a U-Net (see gure 11.10). In this case, it maps
a grayscale image to a plausibly colored version. The U-Net is trained with
two losses. First, the content loss encourages the output image to have a sim-
ilar structure to the input image. Second, the adversarial loss encourages the
grayscale/color image pair to be indistinguishable from a real pair in each local
region of these images. This framework can be adapted to many tasks, including
b) translating maps to satellite imagery, c) converting sketches of bags to photo-
realistic examples, d) colorization, and e) converting label maps to photorealistic
building facades. Adapted from Isola et al. (2017).
Draft: please send errata to udlbookmail@gmail.com.
294 15 Generative Adversarial Networks
Figure 15.17 Super-resolution generative adversarial network (SRGAN). a) A
convolutional network with residual connections is trained to increase the reso-
lution of images by a factor of four. The model has losses that encourage the
content to be close to the true high-resolution image. However, it also includes
an adversarial loss, which penalizes results that can be distinguished from real
high-resolution images. b) Upsampled image using bicubic interpolation. c) Up-
sampled image using SRGAN. d) Upsampled image using bicubic interpolation.
e) Upsampled image using SRGAN. Adapted from Ledig et al. (2017).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
15.6 StyleGAN 295
one direction (e.g., photoMonet) and then back again should recover the original.
The CycleGAN loss function is a weighted sum of three losses (gure 15.18). The
content loss encourages the before and after images to be similar and is based on the
1
norm. The adversarial loss uses a discriminator to encourage the output to be indistin-
guishable from real examples of the target domain. Finally, the cycle-consistency loss
encourages the mapping to be reversible. Here, two models are trained together. One
maps from the rst domain to the second, and the other in the opposite direction. The
cycle-consistency loss will be low if the translated image can be itself translated success-
fully back to the image in the original domain. The model combines these three losses
to train networks to translate images from one style to another and back again.
15.6 StyleGAN
StyleGAN is a more contemporary GAN that partitions the variation in a dataset into
meaningful components, each of which is controlled by a subset of the latent variables.
In particular, StyleGAN controls the output image at dierent scales and separates
style from noise. For face images, large-scale changes include face shape and head pose,
medium-scale changes include the shape and details of facial features, and ne-scale
changes include hair and skin color. The style components represent aspects of the
image that are salient to human beings, and the noise aspects represent unimportant
variation such as the exact placement of hairs, stubble, freckles, or skin pores.
The GANs that we have seen until now started from a latent variable z which is
drawn from a standard base distribution. This was passed through a series of convolu-
tional layers to produce the output image. However, the latent variable inputs to the
generator can (i) be introduced at various points in the architecture and (ii) modify the
current representation at these points in dierent ways. StyleGAN makes these choices
judiciously to control scale and to separate style from noise (gure 15.19).
The main generative branch of StyleGAN starts with a learned constant 4×4 repre-
sentation with 512 channels. This passes through a series of convolutional layers that
gradually upsample the representation to generate the image at its nal resolution. Two
sets of random latent variables representing style and noise are introduced at each scale;
the closer that they are to the output, the ner scale details they represent.
The latent variables that represent noise are independently sampled Gaussian vec-
tors z
1
, z
2
. . . and are injected additively after each convolution operation in the main
generative pipeline. They are the same spatial size as the main representation at the point
that they are added but are multiplied by learned per-channel scaling factors ψ
1
, ψ
2
. . .
and so contribute in dierent amounts to each channel. As the resolution of the network
increases, this noise contributes at ner scales.
The latent variables that represent style begin as a 1×1×512 noise tensor, which is
passed through a seven-layer fully connected network to create an intermediate vari-
able w. This allows the network to decorrelate aspects of style so that each dimension
of w can represent an independent real-world factor such as head pose or hair color.
This variable w is linearly transformed to a 2×1×512 tensor y, which is used to set
the per-channel mean and variance of the representation across spatial positions in the
Draft: please send errata to udlbookmail@gmail.com.
296 15 Generative Adversarial Networks
Figure 15.18 CycleGAN. Two models are trained simultaneously. The rst c
=
g[c
j
, θ] translates from an image c in the rst style (horse) to an image c
in the
second style (zebra). The second model c = g
[c
, θ] learns the opposite map-
ping. The cycle consistency loss penalizes both models if they cannot successfully
convert an image to the other domain and back to the original. In addition, two
adversarial losses encourage the translated images to look like realistic examples
of the target domain (shown here for zebra only). Two content losses encourage
the details and layout of the images before and after each mapping to be similar
(i.e., the zebra is in the same position and pose that the horse was and against
the same background and vice versa). Adapted from Zhu et al. (2017).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
15.7 Summary 297
Figure 15.19 StyleGAN. The main pipeline (center row) starts with a constant
learned representation (gray box). This is passed through a series of convolutional
layers and gradually upsampled to create the output. Noise (top row) is added
at dierent scales by periodically adding Gaussian variables z
with per-channel
scaling ψ
. The Gaussian style variable z is passed through a fully connected
network to create intermediate variable w (bottom row). This is used to set the
mean and variance of each channel at various points in the pipeline.
main branch after noise addition. This is termed adaptive instance normalization (g-
ure 11.14e). A series of vectors y
1
, y
2
, . . . are injected in this way at several dierent
points in the main branch, so the same style contributes at dierent scales. Figure 15.20
shows examples of manipulating the style and noise vectors are dierent scales.
15.7 Summary
GANs learn a generator network that transforms random noise into data that is indistin-
guishable from a training set. To this end, the generator is trained using a discriminator
network that tries to distinguish real examples from generated samples. The generator
is then updated so that the data that it creates is identied as being more “real” by the
discriminator. The original formulation of this idea has the aw that the training signal
is weak when it’s easy to determine if the samples are real or generated. This led to the
Draft: please send errata to udlbookmail@gmail.com.
298 15 Generative Adversarial Networks
Figure 15.20 StyleGAN results. First four columns show systematic changes in
style at various scales. Fifth column shows the eect of increasing noise magni-
tude. Last two columns show dierent noise vectors at two dierent scales.
Wasserstein GAN, which provides a more consistent training signal.
We reviewed convolutional GANs for generating images and a series of tricks that
improve the quality of the generated images, including progressive growing, mini-batch
discrimination, and truncation. Conditional GAN architectures introduce an auxiliary
vector that allows control over the output (e.g., the choice of object class). Image trans-
lation tasks retain this conditional information in the form of an image but dispense with
the random noise. The GAN discriminator now works as an additional loss term that
favors “realistic” looking images. Finally, we described StyleGAN, which injects noise
into the generator strategically to control the style and noise at dierent scales.
Notes
Goodfellow et al. (2014) introduced generative adversarial networks. An early review of progress
can be found in Goodfellow (2016). More recent overviews include Creswell et al. (2018) and
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 299
Gui et al. (2021). Park et al. (2021) present a review of GAN models that focuses on computer
vision applications. Hindupur (2022) maintains a list of named GAN models (numbering 501
at the time of writing) from ABC-GAN (Susmelj et al., 2017) right through to ZipNet-GAN
(Zhang et al., 2017b). Odena (2019) lists open problems concerning GANs.
Data: GANs have primarily been developed for image data. Examples include the deep con-
volutional GAN (Radford et al., 2015), progressive GAN (Karras et al., 2018), and StyleGAN
(Karras et al., 2019) models presented in this chapter. For this reason, most GANs are based on
convolutional layers, although more recently, GANs that exploit transformers in the generator
and discriminator to capture long-range correlations have been developed (e.g., SAGAN, Zhang
et al., 2019b). However, GANs have also been used to generate molecular graphs (De Cao &
Kipf, 2018), voice data (Saito et al., 2017; Donahue et al., 2018b; Kaneko & Kameoka, 2017;
Fang et al., 2018), EEG data (Hartmann et al., 2018), text (Lin et al., 2017a; Fedus et al.,
2018), music (Mogren, 2016; Guimaraes et al., 2017; Yu et al., 2017), 3D models (Wu et al.,
2016), DNA (Killoran et al., 2017), and video data (Vondrick et al., 2016; Wang et al., 2018a).
GAN loss functions: It was originally claimed that GANs converged to Nash equilibria
during training. However, more recent evidence suggests that this isn’t always the case (Farnia
& Ozdaglar, 2020; Jin et al., 2020; Berard et al., 2019). (Arjovsky et al., 2017; Metz et al., 2017;
Qi, 2020) identied that the original GAN loss function was unstable, and this led to dierent
formulations. Mao et al. (2017) introduced the least squares GAN. For some parameter choices,
this implicitly minimizes the Pearson χ
2
divergence. Nowozin et al. (2016) argue that the
Jensen-Shannon divergence is a special case of a larger family of f-divergences and show that
any f-divergence can be used for training GANs. Jolicoeur-Martineau (2019) introduces the
relativistic GAN in which the discriminator estimates the probability that a real data example
is more realistic than a generated one rather than the absolute probability that it is real.
Zhao et al. (2017a) reformulate the GAN into a general energy-based framework in which the
discriminator is a function that attributes low energies to real data and higher energies elsewhere.
As an example, they use an autoencoder and base the energy on reconstruction error.
Arjovsky & Bottou (2017) analyzed vanishing gradients in GANs, and this led to the Wasserstein
GAN (Arjovsky et al., 2017), which is based on earth mover’s distance/optimal transport. The
Wasserstein formulation requires that the Lipschitz constant of the discriminator is less than
one; the original paper proposed to clip the weights in the discriminator, but subsequent work
imposed a gradient penalty (Gulrajani et al., 2016) or applied spectral normalization (Miyato
et al., 2018) to limit the Lipschitz constant. Other variations of the Wasserstein GAN were
introduced by Wu et al. (2018a), Bellemare et al. (2017b), and Adler & Lunz (2018). Hermann
(2017) presents an excellent blog post discussing duality and the Wasserstein GAN. For more
information about optimal transport, consult the book by Peyré et al. (2019). Lucic et al.
(2018) present an empirical comparison of GAN loss functions of the time.
Tricks for training GANs: Many heuristics improve the stability of training GANs and the
quality of the nal results. Marchesi (2017) rst used the truncation trick (gure 15.10) to trade
o the variability of GAN outputs relative to their quality. This was also proposed by Pieters
& Wiering (2018) and Brock et al. (2019), who added a regularizer that encourages the weight
matrices in the generator to be orthogonal. This means that truncating the latent variable has
a closer relationship to truncating the output variance and improves sample quality.
Other tricks include only using the gradients from the top K most realistic images (Sinha et al.,
2020), label smoothing in the discriminator (Salimans et al., 2016), updating the discriminator
using a history of generated images rather than the ones produced by the latest generator to
avoid model “oscillation” (Salimans et al., 2016), and adding noise to the discriminator input
(Arjovsky & Bottou, 2017). Kurach et al. (2019) present an overview of normalization and
regularization in GANs. Chintala et al. (2020) provide further suggestions for training GANs.
Draft: please send errata to udlbookmail@gmail.com.
300 15 Generative Adversarial Networks
Sample diversity: The original GAN paper (Goodfellow et al., 2014) argued that given
enough capacity, training samples, and computation time, a GAN can learn to minimize the
Jensen-Shannon divergence between the generated samples and the true distribution. However,
subsequent work has cast doubt on whether this happens in practice. Arora et al. (2017) sug-
gest that the nite capacity of the discriminator means that the GAN training objective can
approach its optimum value even when the variation in the output distribution is limited. Wu
et al. (2017) approximated the log-likelihoods of the distributions produced by GANs using an-
nealed importance sampling and found a mismatch between the generated and real distributions.
Arora & Zhang (2017) ask human observers to identify GAN samples that are (near-)duplicates
and infer the diversity of images from the frequency of these duplicates. They found that for
DCGAN, a duplicate occurs with probability >50% with 400 samples; this implies that the
support size was 400, 000, which is smaller than the training set. They also showed that the
diversity increased as a function of the discriminator size. Bau et al. (2019) take a dierent
approach and investigate the parts of the data space that GANs cannot generate.
Increasing diversity and preventing mode collapse: The extreme case of lack of diversity
is mode collapse, in which the network repeatedly produces the same image (Salimans et al.,
2016). This is a particular problem for conditional GANs, where the latent variable is sometimes
completely ignored, and the output depends only on the conditional information. Mao et al.
(2019) introduce a regularization term to help prevent mode collapse in conditional GANs, which
maximizes the ratio of the distance between generated images with respect to the corresponding
latent variables and hence encourages diversity in the outputs. Other work that aims to reduce
mode collapse includes VEEGAN (Srivastava et al., 2017), which introduces a reconstruction
network that maps the generated image back to the original noise and hence discourages many-
to-one mappings from noise to images.
Salimans et al. (2016) suggested computing statistics across the mini-batch and using the dis-
criminator to ensure that these are indistinguishable from the statistics of batches of real images.
This is known as mini-batch discrimination and is implemented by adding a layer toward the
end of the discriminator that learns a tensor for each image that captures the statistics of the
batch. This was simplied by Karras et al. (2018), who computed a standard deviation for each
feature in each spatial location over the mini-batch. Then they average over spatial locations
and features to get a single estimate. This is replicated to get a single feature map, which is
appended to a layer near the end of the discriminator network. Lin et al. (2018) pass concate-
nated (real or generated) samples to the discriminator and provide a theoretical analysis of how
presenting multiple samples to the discriminator increases diversity. MAD-GAN (Ghosh et al.,
2018) increases the diversity of GAN samples by using multiple generators and requiring the
single discriminator to identify which generator created the samples, thus providing a signal to
help push the generators to create dierent samples from one another.
Multiple scales: Wang et al. (2018b) used multiple discriminators at dierent scales to help
ensure that image quality is high in all frequency bands. Other work dened both generators
and discriminators at dierent resolutions (Denton et al., 2015; Zhang et al., 2017d; Huang
et al., 2017c). Karras et al. (2018) introduced the progressive growing method (gure 15.9),
which is somewhat simpler and faster to train.
StyleGAN: Karras et al. (2019) introduced the StyleGAN framework (section 15.6). In sub-
sequent work (Karras et al., 2020b), they improved the quality of generated images by (i)
redesigning the normalization layers in the generator to remove “water droplet” artifacts and
(ii) reducing artifacts where ne details do not follow the coarse details by changing the pro-
gressive growing framework. Further improvements include developing methods to train GANs
with limited data (Karras et al., 2020a) and xing aliasing artifacts (Karras et al., 2021). A
large body of work nds and manipulates the latent variables in the StyleGAN to edit images
(e.g., Abdal et al., 2021; Collins et al., 2020; Härkönen et al., 2020; Patashnik et al., 2021; Shen
et al., 2020b; Tewari et al., 2020; Wu et al., 2021; Roich et al., 2022).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 301
Conditional GANs: The conditional GAN was developed by Mirza & Osindero (2014), the
auxiliary classier GAN by Odena et al. (2017), and the InfoGAN by Chen et al. (2016b). The
discriminators of these models usually append the conditional information to the discriminator
input (Mirza & Osindero, 2014; Denton et al., 2015; Saito et al., 2017) or to an intermediate
hidden layer in the discriminator (Reed et al., 2016a; Zhang et al., 2017d; Perarnau et al.,
2016). However, Miyato & Koyama (2018) experimented with taking the inner product between
embedded conditional information with a layer of the discriminator, motivated by the role of
the class information in the underlying probabilistic model. Images generated by GANs have
variously been conditioned on classes (e.g., Odena et al., 2017), input text (Reed et al., 2016a;
Zhang et al., 2017d), attributes (Yan et al., 2016; Donahue et al., 2018a; Xiao et al., 2018b),
bounding boxes and keypoints (Reed et al., 2016b), and images (e.g., Isola et al., 2017)).
Image translation: Isola et al. (2017) developed the Pix2Pix algorithm (gure 15.16), and a
similar system with higher-resolution results was subsequently developed by Wang et al. (2018b).
StarGAN (Choi et al., 2018) performs image-to-image translation across multiple domains using
only a single model. The idea of cycle consistency loss was introduced by Zhou et al. (2016b)
in DiscoGAN and Zhu et al. (2017) in CycleGAN (gure 15.18).
Adversarial loss: In many image translation tasks, there is no “generator”; such models can
be considered supervised learning tasks with an adversarial loss that encourages realism. The
super-resolution algorithm of Ledig et al. (2017) is a good example of this (gure 15.17). Esser
et al. (2021) used an autoencoder with an adversarial loss. This network takes an image,
reduces the representation size to create a “bottleneck,” and then reconstructs the image from
this reduced data space. In practice, the architecture is similar to encoder-decoder networks
(e.g., gure 10.19). After training, the autoencoder reproduces something that is both close
to the image and looks highly realistic. They vector quantize (discretize) the bottleneck of
the autoencoder and then learn a probability distribution over the discrete variables using a
transformer decoder. By sampling from this transformer decoder, they can produce extremely
large high-quality images.
Inverting GANs: One way to edit real images is to project them to the latent space, manip-
ulate the latent variable, and then re-project them to image space. This process is known as
resynthesis. Unfortunately, GANs only map from the latent variable to the observed data, not
vice versa. This has led to methods to invert GANs (i.e., nd the latent variable that corre-
sponds as closely as possible to an observed image). These methods fall into two classes. The
rst learns a network that maps in the opposite direction (Donahue et al., 2018b; Luo et al.,
2017a; Perarnau et al., 2016; Dumoulin et al., 2017; Guan et al., 2020). This is known as an
encoder. The second approach is to start with some latent variable z and optimize it until it
reconstructs the image as closely as possible (Creswell & Bharath, 2018; Karras et al., 2020b;
Abdal et al., 2019; Lipton & Tripathi, 2017). Zhu et al. (2020a) combine both approaches.
There has been particular interest in inversion for StyleGAN because it produces excellent results
and can control the image at dierent scales. Unfortunately, Abdal et al. (2020) showed that
it is not possible to invert StyleGAN without artifacts and proposed inverting to an extended
style space, and Richardson et al. (2021) trained an encoder that reliably maps to this space.
Even after inverting to the extended space, editing images that are out of domain may still not
work well. Roich et al. (2022) address this issue by ne-tuning the generator of StyleGAN so
that it reconstructs the image exactly and show that the result can be edited well. They also
add extra terms that reconstruct nearby points exactly so that the modication is local. This
technique is known as pivotal tuning. A survey of GAN inversion techniques can be found in
Xia et al. (2022).
Editing images with GANs: The iGAN (Zhu et al., 2016) allows users to make interactive
edits by scribbling or warping parts of an existing image. The tool then adjusts the output
Draft: please send errata to udlbookmail@gmail.com.
302 15 Generative Adversarial Networks
image to be both realistic and to t these new constraints. It does this by nding a latent
vector that produces an image that is similar to the edited image and obeys the edge map
of any added lines. It is typical also to add a mask so that only parts of the image close to
the edits are changed. EditGAN (Ling et al., 2021) jointly models images and their semantic
segmentation masks and allows edits to that mask.
Problems
Problem 15.1 What will the loss be in equation 15.8 when q(x) = P r(x)?
Problem 15.2
Write an equation relating the loss L in equation 15.8 to the Jensen-Shannon
distance D
JS
[q(x) || P r(x)] in equation 15.9.
Problem 15.3 Consider computing the earth mover’s distance using linear programming in the
primal form. The discrete distributions P r(x=i) and q(x= j) are dened on x = 1, 2, 3, 4 and:
b =
P r(x = 1), P r(x = 2), P r(x = 3), P r(x= 4), q(x =1), q(x =2), q(x=3), q(x = 4)
T
. (15.18)
Write out the contents of the 8×16 matrix A. You may assume that the contents of P have
been vectorized into p column-rst.
Problem 15.4
Calculate (i) the KL divergence, (ii) the reverse KL divergence,(iii) the Jensen-
Shannon divergence, and (iv) the Wasserstein distance between the distributions:
P r(z) =
0 z < 0
1 0 z 1
0 z > 1
, and P r(z) =
0 z < a
1 a z a + 1
0 z > a
. (15.19)
for the range a [3, 3]. To get a formula for the Wasserstein distance for this special case,
consider the total “earth” (i.e., probability mass) that must be moved and multiply this by the
squared distance it must move.
Problem 15.5 The KL distance and Wasserstein distances between univariate Gaussian distri-
butions are given by:
D
kl
= log
σ
2
σ
1
+
σ
2
1
+ (µ
1
µ
2
)
2
2σ
2
2
1
2
, (15.20)
and
D
w
= (µ
1
µ
2
)
2
+ σ
1
+ σ
2
2
σ
1
σ
2
, (15.21)
respectively. Plot these distances as a function of µ
1
µ
2
for the case when σ
1
= σ
2
= 1.
Problem 15.6 Consider a latent variable z with dimension 100. Consider truncating the values
of this variable to (i) τ = 2.0, (ii) τ = 1.0, (iii) τ = 0.5, (iv) τ = 0.04 standard deviations. What
proportion of the original probability distribution is disregarded in each case?
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Chapter 16
Normalizing ows
Chapter 15 introduced generative adversarial networks (GANs). These are generative
models that pass a latent variable through a deep network to create a new sample. GANs
are trained using the principle that the samples should be indistinguishable from real
data. However, they don’t dene a distribution over data examples. Hence, assessing
the probability that a new example belongs to the same dataset isn’t straightforward.
In this chapter, we describe normalizing ows. These learn a probability model by
transforming a simple distribution into a more complicated one using a deep network.
Normalizing ows can both sample from this distribution and evaluate the probability
of new examples. However, they require specialized architecture: each layer must be
invertible. In other words, it must be able to transform data in both directions.
16.1 1D example
Normalizing ows are probabilistic generative models: they t a probability distribution
to training data (gure 14.2b). Consider modeling a 1D distribution P r(x). Normalizing
ows start with a simple tractable base distribution P r(z) over a latent variable z and
apply a function x = f[z, ϕ], where the parameters ϕ are chosen so that P r(x) has the
desired distribution (gure 16.1). Generating a new example x
is easy; we draw z
from
the base density and pass this through the function so that x
= f[z
, ϕ].
16.1.1 Measuring probability
Measuring the probability of a data point x is more challenging. Consider applying a
function f[z, ϕ] to random variable z with known density P r(z). The probability density
will decrease in areas that are stretched by the function and increase in areas that are
compressed so that the area under the new distribution remains one. The degree to
which a function f[z, ϕ] stretches or compresses its input depends on the magnitude of
its gradient. If a small change to the input causes a larger change in the output, it
Draft: please send errata to udlbookmail@gmail.com.
304 16 Normalizing ows
Figure 16.1 Transforming probability distributions. a) The base density is a
standard normal dened on a latent variable z. b) This variable is transformed
by a function x = f[z, ϕ] to a new variable x, which c) has a new distribution. To
sample from this model, we draw values z from the base density (green and brown
arrows in panel (a) show two examples). We pass these through the function f[z, ϕ]
as shown by dotted arrows in panel (b) to generate the values of x, which are
indicated as arrows in panel (c).
Figure 16.2 Transforming distributions. The base density (cyan, bottom) passes
through a function (blue curve, top right) to create the model density (orange,
left). Consider dividing the base density into equal intervals (gray vertical lines).
The probability mass between adjacent lines must remain the same after transfor-
mation. The cyan-shaded region passes through a part of the function where the
gradient is larger than one, so this region is stretched. Consequently, the height
of the orange-shaded region must be lower so that it retains the same area as the
cyan-shaded region. In other places (e.g., z = 2), the gradient is less than one,
and the model density increases relative to the base density.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
16.1 1D example 305
Figure 16.3 Inverse mapping (normalizing direction). If the function is invertible,
then it’s possible to transform the model density back to the original base density.
The probability of a point x under the model density depends partly on the
probability of the equivalent point z under the base density (see equation 16.1).
stretches the function. If a small change to the input causes a smaller change in the
output, it compresses the function (gure 16.2).
More precisely, the probability of data x under the transformed distribution is:
P r(x|ϕ) =
f[z, ϕ]
z
1
· P r(z), (16.1)
where z = f
1
[x, ϕ] is the latent variable that created x. The term P r(z) is the original
Notebook 16.1
1D normalizing
ows
probability of this latent variable under the base density. This is moderated according
to the magnitude of the derivative of the function. If this is greater than one, then the
probability decreases. If it is smaller, the probability increases.
16.1.2 Forward and inverse mappings
To draw samples from the distribution, we need the forward mapping x = f[z, ϕ], but to
measure the likelihood, we need to compute the inverse z = f
1
[x, ϕ]. Hence, we need
to choose f[z, ϕ] judiciously so that it is invertible.
Problems 16.1–16.2
The forward mapping is sometimes termed the generative direction. The base density
is usually chosen to be a standard normal distribution. Hence, the inverse mapping is
termed the normalizing direction since this takes the complex distribution over x and
turns it into a normal distribution over z (gure 16.3).
Draft: please send errata to udlbookmail@gmail.com.
306 16 Normalizing ows
16.1.3 Learning
To learn the distribution, we nd parameters ϕ that maximize the likelihood of the
training data {x
i
}
I
i=1
or equivalently minimize the negative log-likelihood:
ˆ
ϕ = argmax
ϕ
"
I
Y
i=1
P r(x
i
|ϕ)
#
= argmin
ϕ
"
I
X
i=1
log
h
P r(x
i
|ϕ)
i
#
= argmin
ϕ
"
I
X
i=1
log
"
f[z
i
, ϕ]
z
i
#
log
P r(z
i
)
#
, (16.2)
where we have assumed that the data are independent and identically distributed in the
rst line and used the likelihood denition from equation 16.1 in the third line.
16.2 General case
The previous section developed a simple 1D example that modeled a probability dis-
tribution P r(x) by transforming a simpler base density P r(z). We now extend this to
multivariate distributions P r(x) and P r(z) and add the complication that the transfor-
mation is dened by a deep neural network.
Consider applying a function x = f[z, ϕ] to a random variable z R
D
with base
density P r(z), where f[z, ϕ] is a deep network. The resulting variable x R
D
has a
new distribution. A new sample
x
can be drawn from this distribution by (i) drawing
a sample z
from the base density and (ii) passing this through the neural network so
that x
= f[z
, ϕ].
By analogy with equation 16.1, the likelihood of a sample under this distribution is:
P r(x|ϕ) =
f[z, ϕ]
z
1
· P r(z), (16.3)
where z = f
1
[x, ϕ] is the latent variable z that created x. The rst term is the
inverse of the determinant of the D × D Jacobian matrix f[z, ϕ]/∂z, which contains
Appendix B.3.8
Determinant
Appendix B.5
Jacobian
elements f
i
[z, ϕ]/∂z
j
at position (i, j). Just as the absolute derivative measured the
change of area at a point on a 1D function when the function was applied, the absolute
determinant measures the change in volume at a point in the multivariate function. The
second term is the probability of the latent variable under the base density.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
16.2 General case 307
Figure 16.4 Forward and inverse mappings for a deep neural network. The base
density (left) is gradually transformed by the network layers f
1
[, ϕ
1
], f
2
[, ϕ
2
], . . .
to create the model density. Each layer is invertible, and we can equivalently
think of the inverse of the layers as gradually transforming (or “owing”) the
model density back to the base density.
16.2.1 Forward mapping with a deep neural network
In practice, the forward mapping f[z, ϕ] is usually dened by a neural network, consisting
of a series of layers f
k
[, ϕ
k
] with parameters ϕ
k
, which are composed together as:
x = f[z, ϕ] = f
K
f
K1
h
. . . f
2
f
1
[z, ϕ
1
], ϕ
2
, . . . ϕ
K1
i
, ϕ
K
. (16.4)
The inverse mapping (normalizing direction) is dened by the composition of the inverse
of each layer f
1
k
[, ϕ
k
] applied in the opposite order:
z = f
1
[x, ϕ] = f
1
1
f
1
2
h
. . . f
1
K1
f
1
K
[x, ϕ
K
], ϕ
K1
, . . . ϕ
2
i
, ϕ
1
. (16.5)
The base density P r(z) is usually dened as a multivariate standard normal (i.e., with
mean zero and identity covariance). Hence, the eect of each subsequent inverse layer is
to gradually move or “ow” the data density toward this normal distribution (gure 16.4).
This gives rise to the name “normalizing ows.
The Jacobian of the forward mapping can be expressed as:
f[z, ϕ]
z
=
f
K
[f
K1
, ϕ
K
]
f
K1
·
f
K1
[f
K2
, ϕ
K1
]
f
K2
. . .
f
2
[f
1
, ϕ
2
]
f
1
·
f
1
[z, ϕ
1
]
z
, (16.6)
where we have overloaded the notation to make f
k
the output of the function f
k
[, ϕ
k
].
The absolute determinant of this Jacobian can be computed by taking the product of
the individual absolute determinants:
Draft: please send errata to udlbookmail@gmail.com.
308 16 Normalizing ows
f[z, ϕ]
z
=
f
K
[f
K1
, ϕ
K
]
f
K1
·
f
K1
[f
K2
, ϕ
K1
]
f
K2
. . .
f
2
[f
1
, ϕ
2
]
f
1
·
f
1
[z, ϕ
1
]
z
.
(16.7)
The absolute determinant of the Jacobian of the inverse mapping is found by applying
Problem 16.3
the same rule to equation 16.5. It is the reciprocal of the absolute determinant in the
forward mapping.
We train normalizing ows with a dataset {x
i
} of I training examples using the
negative log-likelihood criterion:
ˆ
ϕ = argmax
ϕ
"
I
Y
i=1
P r(z
i
) ·
f[z
i
, ϕ]
z
i
1
#
= argmin
ϕ
"
I
X
i=1
log
"
f[z
i
, ϕ]
z
i
#
log
P r(z
i
)
#
, (16.8)
where z
i
= f
1
[x
i
, ϕ], P r(z
i
) is measured under the base distribution, and the absolute
determinant |f[z
i
, ϕ]/∂z
i
| is given by equation 16.7.
16.2.2 Desiderata for network layers
The theory of normalizing ows is straightforward. However, for this to be practical, we
need neural network layers f
k
that have four properties.
1. Collectively, the set of network layers must be suciently expressive to map a
multivariate standard normal distribution to an arbitrary density.
2. The network layers must be invertible; each must dene a unique one-to-one map-
ping from any input point to an output point (a bijection). If multiple inputs were
Appendix B.1
Bijection
mapped to the same output, the inverse would be ambiguous.
3. It must be possible to compute the inverse of each layer eciently. We need
to do this every time we evaluate the likelihood. This happens repeatedly during
training, so there must be a closed-form solution or a fast algorithm for the inverse.
4. It also must be possible to evaluate the determinant of the Jacobian eciently for
either the forward or inverse mapping.
16.3 Invertible network layers
We now describe dierent invertible network layers or ows for use in these models.
We start with linear and elementwise ows. These are easy to invert, and it’s possible
to compute the determinant of their Jacobians, but neither is suciently expressive to
describe arbitrary transformations of the base density. However, they form the building
blocks of coupling, autoregressive, and residual ows, which are all more expressive.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
16.3 Invertible network layers 309
16.3.1 Linear ows
A linear ow has the form f[h] = β + Ωh. If the matrix is invertible, the linear
transform is invertible. For R
D×D
, the computation of the inverse is O[D
3
]. The
Appendix A
Big O notation
determinant of the Jacobian is just the determinant of , which can also be computed
in O[D
3
]. This means that linear ows become expensive as the dimension D increases.
If the matrix takes a special form, then inversion and computation of the deter-
Appendix B.4
Matrix types
minant can become more ecient, but the transformation becomes less general. For
example, diagonal matrices require only O[D] computation for the inversion and deter-
minant, but the elements of h don’t interact. Orthogonal matrices are also more ecient
Problem 16.4
to invert, and their determinant is xed, but they do not allow scaling of the individual
dimensions. Triangular matrices are more practical; they are invertible using a process
known as back-substitution, which is O[D
2
], and the determinant is just the product of
the diagonal values.
One way to make a linear ow that is general, ecient to invert, and for which the
Jacobian can be computed eciently is to parameterize it directly in terms of the LU
decomposition. In other words, we use:
= PL(U + D), (16.9)
where P is a predetermined permutation matrix, L is a lower triangular matrix, U is
an upper triangular matrix with zeros on the diagonal, and D is a diagonal matrix that
supplies those missing diagonal elements. This can be inverted in O[D
2
], and the log
determinant is just the sum of the log of the absolute values on the diagonal of D.
Unfortunately, linear ows are not suciently expressive. When a linear func-
tion f[h] = β + Ωh is applied to normally distributed input Norm
h
[µ, Σ], then the
result is also normally distributed with mean and variance, β + µ and ΩΣΩ
T
, respec-
Problems 16.5–16.6
tively. Hence, it is not possible to map a normal distribution to an arbitrary density
using linear ows alone.
16.3.2 Elementwise ows
Since linear ows are not suciently expressive, we must turn to nonlinear ows. The
simplest of these are elementwise ows, which apply a pointwise nonlinear function f
[
,
ϕ
]
with parameters ϕ to each element of the input so that:
f[h] =
h
f[h
1
, ϕ], f[h
2
, ϕ], . . . f[h
D
, ϕ]
i
T
. (16.10)
The Jacobian f[h]/∂h is diagonal since the d
th
input to f[h] only aects the d
th
output.
Its determinant is the product of the entries on the diagonal, so:
f[h]
h
=
D
Y
d=1
f[h
d
]
h
d
. (16.11)
The function f[, ϕ] could be a xed invertible nonlinearity like the leaky ReLU
(gure 3.13), in which case there are no parameters, or it may be any parameterized
Problem 16.7
Draft: please send errata to udlbookmail@gmail.com.
310 16 Normalizing ows
Figure 16.5 Piecewise linear mapping. An invertible piecewise linear map-
ping h
= f[h, ϕ] can be created by dividing the input domain h [0, 1] into K
equally sized regions (here K = 5). Each region has a slope with parameter, ϕ
k
.
a) If these parameters are positive and sum to one, then b) the function will be
invertible and map to the output domain h
[0, 1].
invertible one-to-one mapping. A simple example is a piecewise linear function with K
regions (gure 16.5) which maps [0, 1] to [0, 1] as:
f[h, ϕ] =
b1
X
k=1
ϕ
k
!
+ (hK b)ϕ
b
, (16.12)
where the parameters ϕ
1
, ϕ
2
, . . . , ϕ
K
are positive and sum to 1, and b = Kh is the
index of the bin that contains h. The rst term is the sum of all the preceding bins, and
the second term represents the proportion of the way through the current bin that h lies.
This function is easy to invert, and its gradient can be calculated almost everywhere.
Problems 16.8–16.9
There are many similar schemes for creating smooth functions, often using splines with
parameters that ensure the function is monotonic and hence invertible.
Elementwise ows are nonlinear but don’t mix input dimensions, so they can’t create
correlations between variables. When alternated with linear ows (which do mix dimen-
sions), more complex transformations can be modeled. However, in practice, elementwise
ows are used as components of more complex layers like coupling ows.
16.3.3 Coupling ows
Coupling ows divide the input h into two parts so that h = [h
T
1
, h
T
2
]
T
and dene the
ow f[h, ϕ] as:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
16.3 Invertible network layers 311
Figure 16.6 Coupling ows. a) The input (orange vector) is divided into h
1
and h
2
. The rst part h
1
of the output (cyan vector) is a copy of h
1
. The
output h
2
is created by applying an invertible transformation g[, ϕ] to h
2
, where
the parameters ϕ are themselves a (not necessarily invertible) function of h
1
. b)
In the inverse mapping, h
1
= h
1
. This allows us to calculate the parameters ϕ[h
1
]
and then apply the inverse g
1
[h
2
, ϕ] to retrieve h
2
.
h
1
= h
1
h
2
= g
h
h
2
, ϕ[h
1
]
i
. (16.13)
Here g[, ϕ] is an elementwise ow (or other invertible layer) with parameters ϕ[h
1
] that
are themselves a nonlinear function of the inputs h
1
(gure 16.6). The function ϕ[] is
usually a neural network of some kind and does not have to be invertible. The original
variables can be recovered as:
h
1
= h
1
h
2
= g
1
h
h
2
, ϕ[h
1
]
i
. (16.14)
If the function g[, ϕ] is an elementwise ow, the Jacobian will be diagonal with the iden-
tity matrix in the top-left quadrant and the derivatives of the elementwise transformation
in the bottom right. Its determinant is the product of these diagonal values.
The inverse and Jacobian can be computed eciently, but this approach only trans-
forms the second half of the parameters in a way that depends on the rst half. To make
a more general transformation, the elements of h are randomly shued using permuta-
Appendix B.4.4
Permutation matrix
tion matrices between layers, so every variable is ultimately transformed by every other.
In practice, these permutation matrices are dicult to learn. Hence, they are initialized
randomly and then frozen. For structured data like images, the channels are divided into
two halves h
1
and h
2
and permuted between layers using 1×1 convolutions.
Draft: please send errata to udlbookmail@gmail.com.
312 16 Normalizing ows
Figure 16.7 Autoregressive ows. The input h (orange column) and output h
(cyan column) are split into their constituent dimensions (here four dimensions).
a) Output h
1
is an invertible transformation of input h
1
. Output h
2
is an in-
vertible function of input h
2
where the parameters depend on h
1
. Output h
3
is an invertible function of input h
3
where the parameters depend on previous
inputs h
1
and h
2
, and so on. None of the outputs depend on one another, so
they can be computed in parallel. b) The inverse of the autoregressive ow is
computed using a similar method as for coupling ows. However, notice that to
compute
h
2
we must already know h
1
, to compute h
3
, we must already know h
1
and h
2
, and so on. Consequently, the inverse cannot be computed in parallel.
16.3.4 Autoregressive ows
Autoregressive ows are a generalization of coupling ows that treat each input dimension
as a separate “block” (gure 16.7). They compute the d
th
dimension of the output h
based on the rst d1 dimensions of the input h:
h
d
= g
h
h
d
, ϕ[h
1:d1
]
i
. (16.15)
The function g[, ] is termed the transformer,
1
and the parameters ϕ, ϕ[h
1
], ϕ[h
1
, h
2
], . . .
are termed conditioners. As for coupling ows, the transformer g[, ϕ] must be invert-
ible, but the conditioners ϕ[] can take any form and are usually neural networks. If the
transformer and conditioner are suciently exible, autoregressive ows are universal
approximators in that they can represent any probability distribution.
It’s possible to compute all of the entries of the output h
in parallel using a network
with appropriate masks so that the parameters ϕ at position d only depend on previous
1
This is nothing to do with the transformer layers discussed in chapter 12.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
16.3 Invertible network layers 313
positions. This is known as a masked autoregressive ow. The principle is very similar to
masked self-attention (section 12.7.2); connections that relate inputs to previous outputs
are pruned.
Inverting the transformation is less ecient. Consider the forward mapping:
h
1
= g
h
h
1
, ϕ
i
h
2
= g
h
h
2
, ϕ[h
1
]
i
h
3
= g
h
h
3
, ϕ[h
1:2
]
i
h
4
= g
h
h
4
, ϕ[h
1:3
]
i
. (16.16)
This must be inverted sequentially using a similar principle as for coupling ows:
h
1
= g
1
h
h
1
, ϕ
i
h
2
= g
1
h
h
2
, ϕ[h
1
]
i
h
3
= g
1
h
h
3
, ϕ[h
1:2
]
i
h
4
= g
1
h
h
4
, ϕ[h
1:3
]
i
. (16.17)
This can’t be done in parallel as the computation for h
d
depends on h
1:d1
(i.e., the
Notebook 16.2
Autoregressive ows
partial results so far). Hence, inversion is time-consuming when the input is large.
16.3.5 Inverse autoregressive ows
Masked autoregressive ows are dened in the normalizing (inverse) direction. This is
required to evaluate the likelihood eciently and hence to learn the model. However,
sampling requires the forward direction, in which each variable must be computed se-
quentially at each layer, which is slow. If we use an autoregressive ow for the forward
(generative) transformation, then sampling is ecient, but computing the likelihood (and
training) is slow. This is known as an inverse autoregressive ow.
A trick that allows fast learning and also fast (but approximate) sampling is to
build a masked autoregressive ow to learn the distribution (the teacher) and then use
this to train an inverse autoregressive ow from which we can sample eciently (the
student). This requires a dierent formulation of normalizing ows that learns from
another function rather than a set of samples (see section 16.5.3).
16.3.6 Residual ows: iRevNet
Residual ows take their inspiration from residual networks. They divide the input into
two parts h = [h
T
1
, h
T
2
]
T
(as for coupling ows) and dene the outputs as:
Draft: please send errata to udlbookmail@gmail.com.
314 16 Normalizing ows
Figure 16.8 Residual ows. a) An invertible function is computed by splitting the
input into h
1
and h
2
and creating two residual layers. In the rst, h
2
is processed
and h
1
is added. In the second, the result is processed, and h
2
is added. b) In
the reverse mechanism the functions are computed in the opposite order, and the
addition operation becomes subtraction.
h
1
= h
1
+ f
1
[h
2
, ϕ
1
]
h
2
= h
2
+ f
2
[h
1
, ϕ
2
], (16.18)
where f
1
[, ϕ
1
] and f
2
[, ϕ
2
] are two functions that do not necessarily have to be invertible
(gure 16.8). The inverse can be computed by reversing the order of computation:
h
2
= h
2
f
2
[h
1
, ϕ
2
]
h
1
= h
1
f
1
[h
2
, ϕ
1
]. (16.19)
As for coupling ows, the division into blocks restricts the family of transformations
that can be represented. Hence, the inputs are permuted between layers so that the
variables can mix in arbitrary ways.
This formulation can be inverted easily, but for general functions f
1
[, ϕ
1
] and f
2
[, ϕ
2
],
there is no ecient way to compute the Jacobian. This formulation is sometimes used to
Problem 16.10
save memory when training residual networks; because the network is invertible, storing
the activations at each layer in the forward pass is unnecessary.
16.3.7 Residual ows and contraction mappings: iResNet
A dierent approach to exploiting residual networks is to utilize the Banach xed point
theorem or contraction mapping theorem, which states that every contraction mapping
has a xed point. A contraction mapping f[] has the property that:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
16.3 Invertible network layers 315
Figure 16.9 Contraction mappings. If a function has an absolute slope of less
than one everywhere, iterating the function converges to a xed point f[z] = z. a)
Starting at z
0
, we evaluate z
1
= f[z
0
]. We then pass z
1
back into the function and
iterate. Eventually, the process converges to the point where f[z] = z (i.e., where
the function crosses the dashed diagonal identity line). b) This can be used to
invert equations of the form y = z + f[z] for a value y
by noticing that the xed
point of y
f[z] (where the orange line crosses the dashed identity line) is at
the same position as where y
= z + f[z].
dist
h
f[z
], f[z]
i
< β · dist
h
z
, z
i
z, z
, (16.20)
where dist[, ] is a distance function and 0 < β < 1. When a function with this property
is iterated (i.e., the output is repeatedly passed back in as an input), the result converges
Notebook 16.3
Contraction mappings
to a xed point where f[z] = z (gure 16.9). To understand this, consider applying the
function to both the xed point and the current position; the xed point remains static,
but the distance between the two must become smaller, so the current position must get
closer to the xed point.
This theorem can be exploited to invert an equation of the form:
y = z + f[z] (16.21)
if f[z] is a contraction mapping. In other words, it can be used to nd the z
that maps
to a given value, y
. This can be done by starting with any point z
0
and iterating z
k+1
=
y
f[z
k
]. This has a xed point at z + f[z] = y
(gure 16.9b).
The same principle can be used to invert residual network layers of the form h
=
h+f[h, ϕ] if we ensure that f[h, ϕ] is a contraction mapping. In practice, this means that
Appendix B.1.1
Lipschitz constant
the Lipschitz constant must be less than one. Assuming that the slope of the activation
Appendix B.3.7
Eigenvalues
functions is not greater than one, this is equivalent to ensuring the largest eigenvalue of
Draft: please send errata to udlbookmail@gmail.com.
316 16 Normalizing ows
each weight matrix must be less than one. A crude way to do this is to ensure that
the absolute magnitudes of the weights are small by clipping them.
The Jacobian determinant cannot be computed easily, but its logarithm can be ap-
proximated using a series of tricks.
log
"
I +
f[h, ϕ]
h
#
= trace
"
log
I +
f[h, ϕ]
h
#
=
X
k=1
(1)
k1
k
trace
"
f[h, ϕ]
h
#
k
, (16.22)
where we have used the identity log[|A|] = trace[log[A]] in the rst line and expanded
this into a power series in the second line.
Even when we truncate this series, it’s still computationally expensive to compute
Appendix B.3.8
Trace
the trace of the constituent terms. Hence, we approximate this using Hutchinson’s trace
estimator. Consider a normal random variable ϵ with mean 0 and variance I. The trace
of a matrix A can be estimated as:
trace[A] = trace
AE
ϵϵ
T

= trace
E
Aϵϵ
T

= E
trace
Aϵϵ
T

= E
trace
ϵ
T
Aϵ

= E
ϵ
T
Aϵ
, (16.23)
where the rst line is true because E[ϵϵ
T
] = I. The second line derives from the properties
of the expectation operator. The third line comes from the linearity of the trace operator.
The fourth line is due to the invariance of the trace to cyclic permutation. The nal line
is true because the argument in the fourth line is now a scalar. We estimate the trace
by drawing samples ϵ
i
from P r(ϵ):
trace[A] = E
ϵ
T
Aϵ
1
I
I
X
i=1
ϵ
T
i
Aϵ
i
. (16.24)
In this way, we can approximate the trace of the powers of the Taylor expansion (equa-
tion 16.22) and evaluate the log probability.
16.4 Multi-scale ows
In normalizing ows, the latent space z must be the same size as the data space x, but
we know that natural datasets can often be described by fewer underlying variables. At
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
16.5 Applications 317
Figure 16.10 Multiscale ows. The latent space z must be the same size as the
model density in normalizing ows. However, it can be partitioned into several
components, which can be gradually introduced at dierent layers. This makes
both density estimation and sampling faster. For the inverse process, the black
arrows are reversed, and the last part of each block skips the remaining processing.
For example, f
1
3
[, ϕ
3
] only operates on the rst three blocks, and the fourth
block becomes z
4
and is assessed against the base density.
some point, we have to introduce all of these variables, but it is inecient to pass them
through the entire network. This leads to the idea of multi-scale ows (gure 16.10).
In the generative direction, multi-scale ows partition the latent vector into z =
[z
1
, z
2
, . . . , z
N
]. The rst partition z
1
is processed by a series of reversible layers with
the same dimension as z
1
until, at some point, z
2
is appended and combined with the
rst partition. This continues until the network is the same size as the data x. In the
normalizing direction, the network starts at the full dimension of x, but when it reaches
the point where z
n
was added, this is assessed against the base distribution.
16.5 Applications
We now describe three applications of normalizing ows. First, we consider modeling
probability densities. Second, we consider the GLOW model for synthesizing images.
Finally, we discuss using normalizing ows to approximate other distributions.
16.5.1 Modeling densities
Of the four generative models discussed in this book, normalizing ows is the only model
that can compute the exact log-likelihood of a new sample. Generative adversarial
Draft: please send errata to udlbookmail@gmail.com.
318 16 Normalizing ows
Figure 16.11 Modeling densities. a) Toy 2D data samples. b) Modeled density
using iResNet. c–d) Second example. Adapted from Behrmann et al. (2019)
networks are not probabilistic, and both variational autoencoders and diusion models
can only return a lower bound on the likelihood.
2
Figure 16.11 depicts the estimated
probability distributions in two toy problems using i-ResNet. One application of density
estimation is anomaly detection; the data distribution of a clean dataset is described
using a normalizing ow model. New examples with low probability are agged as
outliers. However, caution must be used as there may exist outliers with high probability
that don’t fall in the typical set (see gure 8.13).
16.5.2 Synthesis
Generative ows, or GLOW, is a normalizing ow model that can create high-delity
images (gure 16.12) and uses many of the ideas from this chapter. It is easiest under-
stood in the normalizing direction. GLOW starts with a 256 ×256 ×3 tensor containing
an RGB image. It uses coupling layers, in which the channels are partitioned into two
halves. The second half is subject to a dierent ane transform at each spatial position,
where the parameters of the ane transformation are computed by a 2D convolutional
neural network run on the other half of the channels. The coupling layers are alternated
with 1 × 1 convolutions, parameterized as LU decompositions which mix the channels.
Periodically, the resolution is halved by combining each 2 ×2 patch into one position
with four times as many channels. GLOW is a multi-scale ow, and some of the channels
are periodically removed to become part of the latent vector z. Images are discrete (due
to the quantization of RGB values), so noise is added to the inputs to prevent the training
likelihood increasing without bound. This is known as dequantization.
To sample more realistic images, the GLOW model samples from the base density
raised to a positive power. This chooses examples that are closer to the center of the
density rather than from the tails. This is similar to the truncation trick in GANs
2
The lower bound on the likelihood for diusion models can actually exceed the exact computation
in normalizing ows, but data generation is much slower (see chapter 18).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
16.5 Applications 319
Figure 16.12 Samples from GLOW trained on the CelebA HQ dataset (Karras
et al., 2018). The samples are of reasonable quality, although GANs and diusion
models produce superior results. Adapted from Kingma & Dhariwal (2018).
Figure 16.13 Interpolation using GLOW model. The left and right images are real
people. The intermediate images were computed by projecting the real images to
the latent space, interpolating, and then projecting the interpolated points back
to image space. Adapted from Kingma & Dhariwal (2018).
Draft: please send errata to udlbookmail@gmail.com.
320 16 Normalizing ows
(gure 15.10). Notably, the samples are not as good as those from GANs or diusion
models. It is unknown whether this is due to a fundamental restriction associated with
invertible layers or merely because less research eort has been invested in this goal.
Figure 16.13 shows an example of interpolation using GLOW. Two latent vectors are
computed by transforming two real images in the normalizing direction. Intermediate
points between these latent vectors are computed by linear interpolation, and these are
projected back to image space using the network in the generative direction. The result
is a set of images that interpolate realistically between the two real ones.
16.5.3 Approximating other density models
Normalizing ows can also learn to generate samples that approximate an existing density
which is easy to evaluate but dicult to sample from. In this context, we denote the
normalizing ow P r(x|ϕ) as the student and the target density q(x) as the teacher.
To make progress, we generate samples x
i
= f[z
i
, ϕ] from the student. Since we
generated these samples ourselves, we know their corresponding latent variables z
i
, and
we can calculate their likelihood in the student model without inversion. Thus, we can
use a model like a masked-autoregressive ow where inversion is slow. We dene a loss
function based on the reverse KL divergence that encourages the student and teacher
likelihood to be identical and use this to train the student model (gure 16.14):
Problem 16.11
ˆ
ϕ = argmin
ϕ
"
KL
"
1
I
I
X
i=1
δ
x f[z
i
, ϕ]
q(x)
##
. (16.25)
This approach contrasts with the typical use of normalizing ows to build a proba-
bility model P r(x
i
, ϕ) of data that came from an unknown distribution with samples x
i
using maximum likelihood, which relies on the cross-entropy term from the forward KL
divergence (section 5.7):
ˆ
ϕ = argmin
ϕ
"
KL
"
1
I
I
X
i=1
δ[x x
i
]
P r(x
i
, ϕ)
##
. (16.26)
Normalizing ows can model the posterior in VAEs using this trick (see chapter 17).
16.6 Summary
Normalizing ows transform a base distribution (usually a normal distribution) to create
a new density. They have the advantage that they can both evaluate the likelihood
of samples exactly and generate new samples. However, they have the architectural
constraint that each layer must be invertible; we need the forward transformation to
generate samples and the backward transformation to evaluate the likelihoods.
It’s also important that the Jacobian can be estimated eciently to evaluate the
likelihood; this must be done repeatedly to learn the density. However, invertible layers
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 321
Figure 16.14 Approximating density models. a) Training data. b) Usually, we
modify the ow model parameters to minimize the KL divergence from the train-
ing data to the ow model. This is equivalent to maximum likelihood tting
(section 5.7). c) Alternatively, we can modify the ow parameters ϕ to minimize
the KL divergence from the ow samples x
i
= f[z
i
, ϕ] to d) a target density.
are still useful in their own right even when the Jacobian cannot be estimated eciently;
they reduce the memory requirements of training a K-layer network from O[K] to O[1].
This chapter reviewed invertible network layers or ows. We considered linear ows
and elementwise ows, which are simple but insuciently expressive. Then we described
more complex ows, such as coupling, autoregressive, and residual ows. Finally, we
showed how normalizing ows can be used to estimate likelihoods, generate and inter-
polate between images, and approximate other distributions.
Notes
Normalizing ows were rst introduced by Rezende & Mohamed (2015) but had intellectual
antecedents in the work of Tabak & Vanden-Eijnden (2010), Tabak & Turner (2013), and
Rippel & Adams (2013). Reviews of normalizing ows can be found in Kobyzev et al. (2020)
and Papamakarios et al. (2021). Kobyzev et al. (2020) presented a quantitative comparison of
Draft: please send errata to udlbookmail@gmail.com.
322 16 Normalizing ows
many normalizing ow approaches. They concluded that the Flow++ model (a coupling ow
with a novel elementwise transformation and other innovations) performed best at the time.
Invertible network layers: Invertible layers decrease the memory requirements of the back-
propagation algorithm; the activations in the forward pass no longer need to be stored since
they can be recomputed in the backward pass. In addition to the regular network layers and
residual layers (Gomez et al., 2017; Jacobsen et al., 2018) discussed in this chapter, invertible
layers have been developed for graph neural networks (Li et al., 2021a), recurrent neural net-
works (MacKay et al., 2018), masked convolutions (Song et al., 2019), U-Nets (Brügger et al.,
2019; Etmann et al., 2020), and transformers (Mangalam et al., 2022).
Radial and planar ows: The original normalizing ows paper (Rezende & Mohamed, 2015)
used planar ows (which contract or expand the distribution along certain dimensions) and
radial ows (which expand or contract around a certain point). Inverses for these ows can’t
be computed easily, but they are useful for approximating distributions where sampling is slow
or where the likelihood can only be evaluated up to an unknown scaling factor (gure 16.14).
Applications: Applications include image generation (Ho et al., 2019; Kingma & Dhariwal,
2018), noise modeling (Abdelhamed et al., 2019), video generation (Kumar et al., 2019b), au-
dio generation (Esling et al., 2019; Kim et al., 2018; Prenger et al., 2019), graph generation
(Madhawa et al., 2019), image classication (Kim et al., 2021; Mackowiak et al., 2021), im-
age steganography (Lu et al., 2021), super-resolution (Yu et al., 2020; Wolf et al., 2021; Liang
et al., 2021), style transfer (An et al., 2021), motion style transfer (Wen et al., 2021), 3D shape
modeling (Paschalidou et al., 2021), compression (Zhang et al., 2021b), sRGB to RAW image
conversion (Xing et al., 2021), denoising (Liu et al., 2021b), anomaly detection (Yu et al., 2021),
image-to-image translation (Ardizzone et al., 2020), synthesizing cell microscopy images under
dierent molecular interventions (Yang et al., 2021), and light transport simulation (Müller
et al., 2019b). For applications using image data, noise must be added before learning since the
inputs are quantized and hence discrete (see Theis et al., 2016).
Rezende & Mohamed (2015) used normalizing ows to model the posterior in VAEs. Abdal
et al. (2021) used normalizing ows to model the distribution of attributes in the latent space of
StyleGAN and then used these distributions to change specied attributes in real images. Wolf
et al. (2021) use normalizing ows to learn the conditional image of a noisy input image given a
clean one and hence simulate noisy data that can be used to train denoising or super-resolution
models.
Normalizing ows have also found diverse uses in physics (Kanwar et al., 2020; Köhler et al.,
2020; Noé et al., 2019; Wirnsberger et al., 2020; Wong et al., 2020), natural language processing
(Tran et al., 2019; Ziegler & Rush, 2019; Zhou et al., 2019; He et al., 2018; Jin et al., 2019), and
reinforcement learning (Schroecker et al., 2019; Haarnoja et al., 2018a; Mazoure et al., 2020;
Ward et al., 2019; Touati et al., 2020).
Linear ows: Diagonal linear ows can represent normalization transformations like Batch-
Norm (Dinh et al., 2016) and ActNorm (Kingma & Dhariwal, 2018). Tomczak & Welling (2016)
investigated combining triangular matrices and using orthogonal transformations parameterized
by the Householder transform. Kingma & Dhariwal (2018) proposed the LU parameterization
described in section 16.5.2. Hoogeboom et al. (2019b) proposed using the QR decomposition
instead, which does not require predetermined permutation matrices. Convolutions are lin-
ear transformations (gure 10.4) that are widely used in deep learning, but their inverse and
determinant are not straightforward to compute. Kingma & Dhariwal (2018) used 1×1 con-
volutions, which is eectively a full linear transformation applied separately at each position.
Zheng et al. (2017) introduced ConvFlow, which was restricted to 1D convolutions. Hoogeboom
et al. (2019b) provided more general solutions for modeling 2D convolutions either by stacking
together masked autoregressive convolutions or by operating in the Fourier domain.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 323
Elementwise ows and coupling functions: Elementwise ows transform each variable
independently using the same function (but with dierent parameters for each variable). The
same ows can be used to form the coupling functions in coupling and autoregressive ows, in
which case their parameters depend on the preceding variables. To be invertible, these functions
must be monotone.
An additive coupling function (Dinh et al., 2015) just adds an oset to the variable. Ane
coupling functions scale the variable and add an oset and were used by Dinh et al. (2015),
Dinh et al. (2016), Kingma & Dhariwal (2018), Kingma et al. (2016), and Papamakarios et al.
(2017). Ziegler & Rush (2019) propose the nonlinear squared ow, which is an invertible ratio
of polynomials with ve parameters. Continuous mixture CDFs (Ho et al., 2019) apply a
monotone transformation based on the cumulative density function (CDF) of a mixture of K
logistics, post-composed by an inverse logistic sigmoid, scaled, and oset.
The piecewise linear coupling function (gure 16.5) was developed by Müller et al. (2019b).
Since then, systems based on cubic splines (Durkan et al., 2019a) and rational quadratic splines
(Durkan et al., 2019b) have been proposed. Huang et al. (2018a) introduced neural autoregres-
sive ows, in which the function is represented by a neural network that produces a monotonic
function. A sucient condition is that the weights are all positive and the activation functions
are monotone. It is hard to train a network with the constraint that the weights are positive, so
this led to unconstrained monotone neural networks (Wehenkel & Louppe, 2019), which model
strictly positive functions and then integrate them numerically to get a monotone function.
Jaini et al. (2019) construct positive functions that can be integrated in closed form based on a
classic result that all positive single-variable polynomials are the sum of squares of polynomials.
Finally, Dinh et al. (2019) investigated piecewise monotonic coupling functions.
Coupling ows: Dinh et al. (2015) introduced coupling ows in which the dimensions were
split in half (gure 16.6). Dinh et al. (2016) introduced RealNVP, which partitioned the image
input by taking alternating pixels or blocks of channels. Das et al. (2019) proposed selecting
features for the propagated part based on the magnitude of the derivatives. Dinh et al. (2016)
interpreted multi-scale ows (in which dimensions are gradually introduced) as coupling ows in
which the parameters ϕ have no dependence on the other half of the data. Kruse et al. (2021)
introduce a hierarchical formulation of coupling ows in which each partition is recursively
divided into two. GLOW (gures 16.12–16.13) was designed by Kingma & Dhariwal (2018) and
uses coupling ows, as do NICE (Dinh et al., 2015), RealNVP (Dinh et al., 2016), FloWaveNet
(Kim et al., 2018), WaveGlOW (Prenger et al., 2019), and Flow++ (Ho et al., 2019).
Autoregressive ows: Kingma et al. (2016) used autoregressive models for normalizing ows.
Germain et al. (2015) developed a general method for masking previous variables. This was
exploited by Papamakarios et al. (2017) to compute all of the outputs in the forward direction
simultaneously in masked autoregressive ows. Kingma et al. (2016) introduced the inverse
autoregressive ow. Parallel WaveNet (Van den Oord et al., 2018) distilled WaveNet (Van den
Oord et al., 2016a), which is a dierent type of generative model for audio, into an inverse
autoregressive ow so that sampling would be fast (see gure 16.14c–d).
Residual ows: Residual ows are based on residual networks (He et al., 2016a). RevNets
(Gomez et al., 2017) and iRevNets (Jacobsen et al., 2018) divide the input into two sections
(gure 16.8), each of which passes through a residual network. These networks are invertible,
but the determinant of the Jacobian cannot be computed easily. The residual connection can
be interpreted as the discretization of an ordinary dierential equation, and this perspective led
to dierent invertible architectures (Chang et al., 2018, 2019a). However, the Jacobian of these
networks could still not be computed eciently. Behrmann et al. (2019) noted that the network
can be inverted using xed point iterations if its Lipschitz constant is less than one. This led to
iResNet, in which the log determinant of the Jacobian can be estimated using Hutchinson’s trace
Draft: please send errata to udlbookmail@gmail.com.
324 16 Normalizing ows
estimator (Hutchinson, 1989). Chen et al. (2019) removed the bias induced by the truncation
of the power series in equation 16.22 by using the Russian Roulette estimator.
Innitesimal ows: If residual networks can be viewed as a discretization of an ordinary
dierential equation (ODE), then the next logical step is to represent the change in the variables
directly by an ODE. The neural ODE was explored by Chen et al. (2018e) and exploits standard
methods for forward and backward propagation in ODEs. The Jacobian is no longer required
to compute the likelihood; this is represented by a dierent ODE in which the change in log
probability is related to the trace of the derivative of the forward propagation. Grathwohl
et al. (2019) used the Hutchinson estimator to estimate the trace and simplied this further.
Finlay et al. (2020) added regularization terms to the loss function that make training easier,
and Dupont et al. (2019) augmented the representation to allow the neural ODE to represent
a broader class of dieomorphisms. Tzen & Raginsky (2019) and Peluchetti & Favaro (2020)
replaced the ODEs with stochastic dierential equations.
Universality: The universality property refers to the ability of a normalizing ow to model
any probability distribution arbitrarily well. Some ows (e.g., planar, elementwise) do not have
this property. Autoregressive ows can be shown to have the universality property when the
coupling function is a neural monotone network (Huang et al., 2018a), based on monotone
polynomials (Jaini et al., 2020) or based on splines (Kobyzev et al., 2020). For dimension D,
a series of D coupling ows can form an autoregressive ow. To understand why, note that
the partitioning into two parts h
1
and h
2
means that at any given layer h
2
depends only on
the previous variables (gure 16.6). Hence, if we increase the size of h
1
by one at every layer,
we can reproduce an autoregressive ow, and the result is universal. It is not known whether
coupling ows can be universal with fewer than D layers. However, they work well in practice
(e.g., GLOW) without the need for this induced autoregressive structure.
Other work: Active areas of research in normalizing ows include the investigation of discrete
ows (Hoogeboom et al., 2019a; Tran et al., 2019), normalizing ows on non-Euclidean manifolds
(Gemici et al., 2016; Wang & Wang, 2019), and equivariant ows (Köhler et al., 2020; Rezende
et al., 2019) which aim to create densities that are invariant to families of transformations.
Problems
Problem 16.1 Consider transforming a uniform base density dened on z [0, 1] using the
function x = f[z] = z
2
. Find an expression for the transformed distribution P r(x).
Problem 16.2
Consider transforming a standard normal distribution:
P r(z) =
1
2π
exp
z
2
2
, (16.27)
with the function:
x = f[z] =
1
1 + exp[z]
. (16.28)
Find an expression for the transformed distribution P r(x).
Problem 16.3
Write expressions for the Jacobian of the inverse mapping z = f
1
[x, ϕ] and the
absolute determinant of that Jacobian in forms similar to equations 16.6 and 16.7.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 325
Problem 16.4 Compute the inverse and the determinant of the following matrices by hand:
1
=
2 0 0 0
0 5 0 0
0 0 1 0
0 0 0 2
2
=
1 0 0 0
2 4 0 0
1 1 2 0
4 2 2 1
. (16.29)
Problem 16.5 Consider a random variable z with mean µ and covariance Σ that is transformed
as x = Az + b. Show that the expected value of x is Aµ + b and that the covariance of x is
AΣA
T
.
Problem 16.6
Prove that if x = f[z] = Az + b and P r(z) = Norm
z
[µ, Σ], then P r(x) =
Norm
x
[Aµ + b, AΣA
T
] using the relation:
P r(x) = Pr(z) ·
f[z]
z
1
. (16.30)
Problem 16.7 The Leaky ReLU is dened as:
LReLU[z] =
(
0.1z z < 0
z z 0
. (16.31)
Write an expression for the inverse of the leaky ReLU. Write an expression for the inverse
absolute determinant of the Jacobian |f[z]/∂z|
1
for an elementwise transformation x = f[z]
of the multivariate variable z where:
f[z] =
h
LReLU[z
1
], LReLU[z
2
], . . . , LReLU[z
D
]
i
T
. (16.32)
Problem 16.8 Consider applying the piecewise linear function f[h, ϕ] dened in equation 16.12
for the domain h
[0, 1] elementwise to an input h = [h
1
, h
2
, . . . , h
D
]
T
so that f[h] =
[f[h
1
, ϕ], f[h
2
, ϕ], . . . , f[h
D
, ϕ]]. What is the Jacobian f[h]/∂h? What is the determinant of
the Jacobian?
Problem 16.9
Consider constructing an element-wise ow based on a conical combination of
square root functions in equally spaced bins:
h
= f[h, ϕ] =
p
[Kh b] ϕ
b
+
b
X
k=1
p
ϕ
k
, (16.33)
where b = Kh is the bin that h falls into, and the parameters ϕ
k
are positive, and sum to
one. Consider the case where K = 5 and ϕ
1
= 0.1, ϕ
2
= 0.2, ϕ
3
= 0.5, ϕ
4
= 0.1, ϕ
5
= 0.1. Draw
the function f[h, ϕ]. Draw the inverse function f
1
[h
, ϕ].
Problem 16.10 Draw the structure of the Jacobian (indicating which elements are zero) for the
forward mapping of the residual ow in gure 16.8 for the cases where f
1
[, ϕ
1
] and f
2
[, ϕ
2
] are
(i) a fully connected neural network, (ii) an elementwise ow.
Problem 16.11
Write out the expression for the KL divergence in equation 16.25. Why does
it not matter if we can only evaluate the probability q(x) up to a scaling factor κ? Does the
network have to be invertible to minimize this loss function? Explain your reasoning.
Draft: please send errata to udlbookmail@gmail.com.
Chapter 17
Variational autoencoders
Generative adversarial networks learn a mechanism for creating samples that are statis-
tically indistinguishable from the training data {x
i
}. In contrast, like normalizing ows,
variational autoencoders, or VAEs, are probabilistic generative models; they aim to learn
a distribution P r(x) over the data (see gure 14.2). After training, it is possible to draw
(generate) samples from this distribution. However, the properties of the VAE mean that
it is unfortunately not possible to evaluate the probability of new examples x
exactly.
It is common to talk about the VAE as if it is the model of P r(x), but this is mislead-
ing; the VAE is a neural architecture that is designed to help learn the model for P r(x).
The nal model for Pr(x) contains neither the “variational” nor the “autoencoder” parts
and might be better described as a nonlinear latent variable model.
This chapter starts by introducing latent variable models in general and then con-
siders the specic case of the nonlinear latent variable model. It will become clear that
maximum likelihood learning of this model is not straightforward. Nevertheless, it is
possible to dene a lower bound on the likelihood, and the VAE architecture approxi-
mates this bound using a Monte Carlo (sampling) method. The chapter concludes by
presenting several applications of the VAE.
17.1 Latent variable models
Latent variable models take an indirect approach to describing a probability distribu-
tion P r(x) over a multi-dimensional variable x. Instead of directly writing the expression
for P r(x), they model a joint distribution P r(x, z) of the data x and an unobserved hid-
Appendix C.1.2
Marginalization
den or latent variable z. They then describe the probability of Pr(x) as a marginalization
of this joint probability so that:
P r(x) =
Z
P r(x, z)dz. (17.1)
Typically, the joint probability P r(x, z) is broken down using the rules of conditional
Appendix C.1.3
Conditional
probability
probability into the likelihood of the data with respect to the latent variables term P r(x|z)
and the prior P r(z):
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
17.2 Nonlinear latent variable model 327
P r(x) =
Z
P r(x|z)P r(z)dz. (17.2)
This is a rather indirect approach to describing P r(x), but it is useful because relatively
simple expressions for P r(x|z) and P r(z) can dene complex distributions P r(x).
17.1.1 Example: mixture of Gaussians
In a 1D mixture of Gaussians (gure 17.1a), the latent variable z is discrete, and the
prior P r(z) is a categorical distribution (gure 5.9) with one probability λ
n
for each
Problem 17.1
possible value of z. The likelihood P r(x|z = n) of the data x given that the latent
variable z takes value n is normally distributed with mean µ
n
and variance σ
2
n
:
P r(z = n) = λ
n
P r(x|z = n) = Norm
x
µ
n
, σ
2
n
. (17.3)
As in equation 17.2, the likelihood P r(x) is given by the marginalization over the latent
variable z (gure 17.1b). Here, the latent variable is discrete, so we sum over its possible
values to marginalize:
P r(x) =
N
X
n=1
P r(x, z = n)
=
N
X
n=1
P r(x|z = n) · P r(z = n)
=
N
X
n=1
λ
n
· Norm
x
µ
n
, σ
2
n
. (17.4)
From simple expressions for the likelihood and prior, we describe a complex multi-modal
probability distribution.
17.2 Nonlinear latent variable model
In the nonlinear latent variable model, both the data x and the latent variable z are
Appendix C.3.2
Multivariate
normal
continuous and multivariate. The prior P r(z) is a standard multivariate normal:
P r(z) = Norm
z
[0, I]. (17.5)
The likelihood P r(x|z, ϕ) is also normally distributed; its mean is a nonlinear func-
tion f[z, ϕ] of the latent variable, and its covariance σ
2
I is spherical:
Draft: please send errata to udlbookmail@gmail.com.
328 17 Variational autoencoders
Figure 17.1 Mixture of Gaussians (MoG). a) The MoG describes a complex
probability distribution (cyan curve) as a weighted sum of Gaussian components
(dashed curves). b) This sum is the marginalization of the joint density P r(x, z)
between the continuous observed data x and a discrete latent variable z.
P r(x|z, ϕ) = Norm
x
h
f[z, ϕ], σ
2
I
i
. (17.6)
The function f [z, ϕ] is described by a deep network with parameters ϕ. The latent vari-
able z is lower dimensional than the data x. The model f[z, ϕ] describes the important
aspects of the data, and the remaining unmodeled aspects are ascribed to the noise σ
2
I.
Notebook 17.1
Latent variable
models
The data probability P r(x|ϕ) is found by marginalizing over the latent variable z:
P r(x|ϕ) =
Z
P r(x, z|ϕ)dz
=
Z
P r(x|z, ϕ) · P r(z)dz
=
Z
Norm
x
h
f[z, ϕ], σ
2
I
i
· Norm
z
[0, I] dz. (17.7)
This can be viewed as an innite weighted sum (i.e., an innite mixture) of spherical
Gaussians with dierent means, where the weights are P r(z) and the means are the
network outputs f[z, ϕ] (gure 17.2).
17.2.1 Generation
A new example x
can be generated using ancestral sampling (gure 17.3). We draw z
Appendix C.4.2
Ancestral sampling
from the prior P r(z) and pass this through the network f[z
, ϕ] to compute the mean of
the likelihood P r(x|z
, ϕ) (equation 17.6), from which we draw x
. Both the prior and
likelihood are normal distributions, so this is straightforward.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
17.2 Nonlinear latent variable model 329
Figure 17.2 Nonlinear latent variable model. A complex 2D density P r(x) (right)
is created as the marginalization of the joint distribution P r(x, z) (left) over the
latent variable z; to create P r(x), we integrate the 3D volume over the dimen-
sion z. For each z, the distribution over x is a spherical Gaussian (two slices
shown) with a mean f [z, ϕ] that is a nonlinear function of z and depends on
parameters ϕ. The distribution P r(x) is a weighted sum of these Gaussians.
Figure 17.3 Generation from nonlinear latent variable model. a) We draw a sam-
ple z
from the prior probability P r(z) over the latent variable. b) A sample x
is then drawn from P r(x|z
, ϕ). This is a spherical Gaussian with a mean that
is a nonlinear function f[, ϕ] of z
and a xed variance σ
2
I. c) If we repeat this
process many times, we recover the density P r(x|ϕ).
Draft: please send errata to udlbookmail@gmail.com.
330 17 Variational autoencoders
Figure 17.4 Jensen’s inequality (discrete
case). The logarithm (black curve) is
a concave function; you can draw a
straight line between any two points on
the curve, and this line will always lie un-
derneath it. It follows that any convex
combination (weighted sum with posi-
tive weights that sum to one) of the six
points on the log function must lie in
the gray region under the curve. Here,
we have weighted the points equally (i.e.,
taken the mean) to yield the cyan point.
Since this point lies below the curve,
log[E[y]] > E[log[y]].
17.3 Training
To train the model, we maximize the log-likelihood over a training dataset {x
i
}
I
i=1
with
respect to the model parameters. For simplicity, we assume that the variance term σ
2
in the likelihood expression is known and concentrate on learning ϕ:
ˆ
ϕ = argmax
ϕ
"
I
X
i=1
log
h
P r(x
i
|ϕ)
i
#
, (17.8)
where:
P r(x
i
|ϕ) =
Z
Norm
x
i
[f[z, ϕ], σ
2
I] · Norm
z
[0, I]dz. (17.9)
Unfortunately, this is intractable. There is no closed-form expression for the integral and
no easy way to evaluate it for a particular value of x.
17.3.1 Evidence lower bound (ELBO)
To make progress, we dene a lower bound on the log-likelihood. This is a function that is
always less than or equal to the log-likelihood for a given value of ϕ and will also depend
on some other parameters θ. Eventually, we will build a network to compute this lower
bound and optimize it. To dene this lower bound, we need Jensen’s inequality.
17.3.2 Jensen’s inequality
Jensen’s inequality says that a concave function g[] of the expectation of data y is
greater than or equal to the expectation of the function of the data:
Appendix B.1.2
Concave functions
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
17.3 Training 331
Figure 17.5 Jensen’s inequality (continuous case). For a concave function, com-
puting the expectation of a distribution Pr(y) and passing it through the function
gives a result greater than or equal to transforming the variable y by the function
and then computing the expectation of the new variable. In the case of the loga-
rithm, we have log[E[y]] E[log[y]]. The left-hand side of the gure corresponds
to the left-hand side of this inequality and the right-hand side of the gure to
the right-hand side. One way of thinking about this is to consider that we are
taking a convex combination of the points in the orange distribution dened over
y [0, 1]. By the logic of gure 17.4, this must lie under the curve. Alternatively,
we can think about the concave function as compressing the high values of y
relative to the low values, so the expected value is lower when we pass y through
the function rst.
g[E[y]] E
g[y]
. (17.10)
In this case, the concave function is the logarithm, so we have:
Problems 17.2–17.3
log
E[y]
E
log[y]
, (17.11)
or writing out the expression for the expectation in full, we have:
log
Z
P r(y)ydy
Z
P r(y) log[y]dy. (17.12)
This is explored in gures 17.4–17.5. In fact, the slightly more general statement is true:
log
Z
P r(y)h[y]dy
Z
P r(y) log[h[y]]dy. (17.13)
where h[y] is a function of y. This follows because h[y] is another random variable with
a new distribution. Since we never specied P r(y), the relation remains true.
Draft: please send errata to udlbookmail@gmail.com.
332 17 Variational autoencoders
Figure 17.6 Evidence lower bound (ELBO). The goal is to maximize the log-
likelihood log[P r(x|ϕ)] (black curve) with respect to the parameters ϕ. The
ELBO is a function that lies everywhere below the log-likelihood. It is a function
of both ϕ and a second set of parameters θ. For xed θ, we get a function
of ϕ (two colored curves for dierent values of θ). Consequently, we can increase
the log-likelihood by either improving the ELBO with respect to a) the new
parameters θ (moving from colored curve to colored curve) or b) the original
parameters ϕ (moving along the current colored curve).
17.3.3 Deriving the bound
We now use Jensen’s inequality to derive the lower bound for the log-likelihood. We
start by multiplying and dividing the log-likelihood by an arbitrary probability distribu-
tion q(z) over the latent variables:
log[P r(x|ϕ)] = log
Z
P r(x, z|ϕ)dz
= log
Z
q(z)
P r(x, z|ϕ)
q(z)
dz
, (17.14)
We then use Jensen’s inequality for the logarithm (equation 17.12) to nd a lower bound:
log
Z
q(z)
P r(x, z|ϕ)
q(z)
dz
Z
q(z) log
P r(x, z|ϕ)
q(z)
dz, (17.15)
where the right-hand side is termed the evidence lower bound or ELBO. It gets this name
because P r(x|ϕ) is called the evidence in the context of Bayes’ rule (equation 17.19).
In practice, the distribution q(z) has parameters θ, so the ELBO can be written as:
ELBO[θ, ϕ] =
Z
q(z|θ) log
P r(x, z|ϕ)
q(z|θ)
dz. (17.16)
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
17.4 ELBO properties 333
To learn the nonlinear latent variable model, we maximize this quantity as a function of
both ϕ and θ. The neural architecture that computes this quantity is the VAE.
17.4 ELBO properties
When rst encountered, the ELBO is a somewhat mysterious object, so we now provide
some intuition about its properties. Consider that the original log-likelihood of the data
is a function of the parameters ϕ and that we want to nd its maximum. For any xed θ,
the ELBO is still a function of the parameters but one that must lie below the original
likelihood function. When we change θ, we modify this function, and depending on our
choice, the lower bound may move closer or further from the log-likelihood. When we
change ϕ, we move along the lower bound function (gure 17.6).
17.4.1 Tightness of bound
The ELBO is tight when, for a xed value of ϕ, the ELBO and the likelihood function
coincide. To nd the distribution q(z|θ) that makes the bound tight, we factor the
Appendix C.1.3
Conditional
probability
numerator of the log term in the ELBO using the denition of conditional probability:
ELBO[θ, ϕ] =
Z
q(z|θ) log
P r(x, z|ϕ)
q(z|θ)
dz
=
Z
q(z|θ) log
P r(z|x, ϕ)P r(x|ϕ)
q(z|θ)
dz
=
Z
q(z|θ) log
P r(x|ϕ)
dz +
Z
q(z|θ) log
P r(z|x, ϕ)
q(z|θ)
dz
= log
P r(x|ϕ)
+
Z
q(z|θ) log
P r(z|x, ϕ)
q(z|θ)
dz
= log
P r(x|ϕ)
D
KL
h
q(z|θ)
P r(z|x, ϕ)
i
. (17.17)
Here, the rst integral disappears between lines three and four since log[P r(x|ϕ)] does
not depend on z, and the integral of the probability distribution q(z|θ) is one. In the
Appendix C.5.1
KL divergence
last line, we have just used the denition of the Kullback-Leibler (KL) divergence.
This equation shows that the ELBO is the original log-likelihood minus the KL di-
vergence D
KL
[q(z|θ)||P r(z|x, ϕ)]. The KL divergence measures the “distance” between
distributions and can only take non-negative values. It follows the ELBO is a lower
bound on log[P r(x|ϕ)]. The KL distance will be zero, and the bound will be tight
when q(z|θ) = P r(z|x, ϕ). This is the posterior distribution over the latent variables z
given observed data x; it indicates which values of the latent variable could have been
responsible for the data point (gure 17.7).
Draft: please send errata to udlbookmail@gmail.com.
334 17 Variational autoencoders
Figure 17.7 Posterior distribution over latent variable. a) The posterior distri-
bution P r(z|x
, ϕ) is the distribution over the values of the latent variable z
that could be responsible for a data point x
. We calculate this via Bayes’
rule P r(z|x
, ϕ) P r(x
|z, ϕ)P r(z). b) We compute the rst term on the right-
hand side (the likelihood) by assessing the probability of x
against the symmetric
Gaussian associated with each value of z. Here, it was more likely to have been
created from z
1
than z
2
. The second term is the prior probability P r(z) over the
latent variable. Combining these two factors and normalizing so the distribution
sums to one gives us the posterior P r(z|x
, ϕ).
17.4.2 ELBO as reconstruction loss minus KL distance to prior
Equations 17.16 and 17.17 are two dierent ways to express the ELBO. A third way is
to consider the bound as reconstruction error minus the distance to the prior:
ELBO[θ, ϕ] =
Z
q(z|θ) log
P r(x, z|ϕ)
q(z|θ)
dz
=
Z
q(z|θ) log
P r(x|z, ϕ)P r(z)
q(z|θ)
dz
=
Z
q(z|θ) log [P r(x|z, ϕ)] dz +
Z
q(z|θ) log
P r(z)
q(z|θ)
dz
=
Z
q(z|θ) log
P r(x|z, ϕ)
dz D
KL
h
q(z|θ)
P r(z)
i
, (17.18)
where the joint distribution P r(x, z|ϕ) has been factored into conditional probabil-
Problem 17.4
ity P r(x|z, ϕ)P r(z) between the rst and second lines, and the denition of KL di-
vergence is used again in the last line.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
17.5 Variational approximation 335
In this formulation, the rst term measures the average agreement P r(x|z, ϕ) of the
latent variable and the data. This is termed the reconstruction loss. The second term
measures the degree to which the auxiliary distribution q(z|θ) matches the prior. This
formulation is the one that is used in the variational autoencoder.
17.5 Variational approximation
We saw in equation 17.17 that the ELBO is tight when q(z|θ) is the posterior Pr(z|x, ϕ).
In principle, we can compute the posterior using Bayes’ rule:
P r(z|x, ϕ) =
P r(x|z, ϕ)P r(z)
P r(x|ϕ)
, (17.19)
but in practice, this is intractable because we can’t evaluate the evidence term P r(x|ϕ)
in the denominator (see section 17.3).
One solution is to make a variational approximation: we choose a simple parametric
form for q(z|θ) and use this to approximate the true posterior. Here, we choose a
Appendix C.3.2
Multivariate
normal
multivariate normal distribution with mean µ and diagonal covariance Σ. This will not
always match the posterior well but will be better for some values of µ and Σ than
others. During training, we will nd the normal distribution that is “closest” to the true
posterior P r(z|x) (gure 17.8). This corresponds to minimizing the KL divergence in
equation 17.17 and moving the colored curves in gure 17.6 upwards.
Since the optimal choice for q(z|θ) was the posterior P r(z|x), and this depends on
the data example x, the variational approximation should do the same, so we choose:
q(z|x, θ) = Norm
z
h
g
µ
[x, θ], g
Σ
[x, θ]
i
, (17.20)
where g[x, θ] is a second neural network with parameters θ that predicts the mean µ
and variance
Σ
of the normal variational approximation.
17.6 The variational autoencoder
Finally, we can describe the VAE. We build a network that computes the ELBO:
ELBO[θ, ϕ] =
Z
q(z|x, θ) log
P r(x|z, ϕ)
dz D
KL
h
q(z|x, θ)
P r(z)
i
, (17.21)
where the distribution q(z|x, θ) is the approximation from equation 17.20.
The rst term still involves an intractable integral, but since it is an expectation with
Appendix C.2
Expectation
respect to q(z|x, θ), we can approximate it by sampling. For any function a[] we have:
Draft: please send errata to udlbookmail@gmail.com.
336 17 Variational autoencoders
Figure 17.8 Variational approximation. The posterior P r(z|x
, ϕ) can’t be com-
puted in closed form. The variational approximation chooses a family of distribu-
tions q(z|x, θ) (here Gaussians) and tries to nd the closest member of this family
to the true posterior. a) Sometimes, the approximation (cyan curve) is good and
lies close to the true posterior (orange curve). b) However, if the posterior is
multi-modal (as in gure 17.7), then the Gaussian approximation will be poor.
E
z
a[z]
=
Z
a[z]q(z|x, θ)dz
1
N
N
X
n=1
a[z
n
], (17.22)
where z
n
is the n
th
sample from q(z|x, θ). This is known as a Monte Carlo estimate.
For a very approximate estimate, we can just use a single sample z
from q(z|x, θ):
ELBO[θ, ϕ] log
P r(x|z
, ϕ)
D
KL
h
q(z|x, θ)
P r(z)
i
. (17.23)
The second term is the KL divergence between the variational distribution q(z|x, θ) =
Appendix C.5.4
KL divergence
between normal
distributions
Norm
z
[µ, Σ] and the prior Pr(z) = Norm
z
[0, I]. The KL divergence between two normal
distributions can be calculated in closed form. For the special case where one distribution
has parameters µ, Σ and the other is a standard normal, it is given by:
D
KL
h
q(z|x, θ)
P r(z)
i
=
1
2
Tr
[
Σ
] +
µ
T
µ
D
z
log
h
det[Σ]
i
.
(17.24)
where D
z
is the dimensionality of the latent space.
17.6.1 VAE algorithm
To summarize, we aim to build a model that computes the evidence lower bound for a
point x. Then we use an optimization algorithm to maximize this lower bound over the
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
17.6 The variational autoencoder 337
Figure 17.9 Variational autoencoder. The encoder g[x, θ] takes a training exam-
ple x and predicts the parameters µ, Σ of the variational distribution q(z|x, θ).
We sample from this distribution and then use the decoder f[z, ϕ] to predict the
data x. The loss function is the negative ELBO, which depends on how accurate
this prediction is and how similar the variational distribution q(z|x, θ) is to the
prior P r(z) (equation 17.21).
dataset and hence improve the log-likelihood. To compute the ELBO we:
compute the mean µ and variance Σ of the variational posterior distribution q(z|θ, x)
for this data point x using the network g[x, θ],
draw a sample z
from this distribution, and
compute the ELBO using equation 17.23.
The associated architecture is shown in gure 17.9. It should now be clear why this
is called a variational autoencoder. It is variational because it computes a Gaussian
approximation to the posterior distribution. It is an autoencoder because it starts with
a data point x, computes a lower-dimensional latent vector z from this, and then uses this
vector to recreate the data point x as closely as possible. In this context, the mapping
from the data to the latent variable by the network g[x, θ] is called the encoder, and the
mapping from the latent variable to the data by the network f[z, ϕ] is called the decoder.
The VAE computes the ELBO as a function of both ϕ and θ. To maximize this
bound, we run mini-batches of samples through the network and update these parameters
with an optimization algorithm such as SGD or Adam. The gradients of the ELBO with
respect to the parameters are computed as usual using automatic dierentiation. During
this process, we are both moving between the colored curves (changing θ) and along them
(changing ϕ) in gure 17.10. During this process, the parameters ϕ change to assign the
data a higher likelihood in the nonlinear latent variable model.
Draft: please send errata to udlbookmail@gmail.com.
338 17 Variational autoencoders
Figure 17.10 The VAE updates both fac-
tors that determine the lower bound at
each iteration. Both the parameters ϕ of
the decoder and the parameters θ of the
encoder are manipulated to increase this
lower bound.
Figure 17.11 Reparameterization trick. With the original architecture (g-
ure 17.9), we cannot easily backpropagate through the sampling step. The repa-
rameterization trick removes the sampling step from the main pipeline; we draw
from a standard normal and combine this with the predicted mean and covariance
to get a sample from the variational distribution.
17.7 The reparameterization trick
There is one more complication; the network involves a sampling step, and it is dicult
to dierentiate through this stochastic component. However, dierentiating past this
step is necessary to update the parameters θ that precede it in the network.
Fortunately, there is a simple solution; we can move the stochastic part into a branch
Problem 17.5
of the network that draws a sample ϵ
from Norm
ϵ
[0, I] and then use the relation:
z
= µ + Σ
1/2
ϵ
, (17.25)
to draw from the intended Gaussian. Now we can compute the derivatives as usual
Notebook 17.2
Reparameterization
trick
because the backpropagation algorithm does not need to pass down the stochastic branch.
This is known as the reparameterization trick (gure 17.11).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
17.8 Applications 339
17.8 Applications
Variational autoencoders have many uses, including denoising, anomaly detection, and
compression. This section reviews several applications for image data.
17.8.1 Approximating sample probability
In section 17.3, we argued that it is not possible to evaluate the probability of a sample
with the VAE, which describes this probability as:
P r(x) =
Z
P r(x|z)P r(z)dz
= E
z
h
P r(x|z)
i
= E
z
h
Norm
x
[f[z, ϕ], σ
2
I]
i
. (17.26)
In principle, we could approximate this probability using equation 17.22 by drawing
samples from P r(z) = Norm
z
[0, I] and computing:
P r(x)
1
N
N
X
n=1
P r(x|z
n
). (17.27)
However, the curse of dimensionality means that almost all values of z
n
that we
draw would have a very low probability; we would have to draw an enormous number
of samples to get a reliable estimate. A better approach is to use importance sampling.
Here, we sample z from an auxiliary distribution q(z), evaluate P r(x|z
n
), and rescale
the resulting values by the probability q(z) under the new distribution:
P r(x) =
Z
P r(x|z)P r(z)dz
=
Z
P r(x|z)P r(z)
q(z)
q(z)dz
= E
q (z)
P r(x|z)P r(z)
q(z)
1
N
N
X
n=1
P r(x|z
n
)P r(z
n
)
q(z
n
)
, (17.28)
where now we draw the samples from q(z). If q(z) is close to the region of z where
Notebook 17.3
Importance
sampling
the P r(x|z) has high likelihood, then we will focus the sampling on the relevant area of
space and estimate P r(x) much more eciently.
The product P r(x|z)P r(z) that we are trying to integrate is proportional to the
posterior distribution P r(z|x) (by Bayes’ rule). Hence, a sensible choice of auxiliary
distribution q(z) is the variational posterior q(z|x) computed by the encoder.
Draft: please send errata to udlbookmail@gmail.com.
340 17 Variational autoencoders
Figure 17.12 Sampling from a standard VAE trained on CELEBA. In each col-
umn, a latent variable z
is drawn and passed through the model to predict the
mean f[z
, ϕ] before adding independent Gaussian noise (see gure 17.3). a) A
set of samples that are the sum of b) the predicted means and c) spherical Gaus-
sian noise vectors. The images look too smooth before we add the noise and too
noisy afterward. This is typical, and usually, the noise-free version is shown since
the noise is considered to represent aspects of the image that are not modeled.
Adapted from Dorta et al. (2018). d) It is now possible to generate high-quality
images from VAEs using hierarchical priors, specialized architecture, and careful
regularization. Adapted from Vahdat & Kautz (2020).
In this way, we can approximate the probability of new samples. With sucient
samples, this will provide a better estimate than the lower bound and could be used to
evaluate the quality of the model by evaluating the log-likelihood of test data. Alterna-
tively, it could be used as a criterion for determining whether new examples belong to
the distribution or are anomalous.
17.8.2 Generation
VAEs build a probabilistic model, and it’s easy to sample from this model by draw-
ing from the prior P r(z) over the latent variable, passing this result through the de-
coder f[z, ϕ], and adding noise according to P r(x|f[z, ϕ]). Unfortunately, samples from
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
17.8 Applications 341
vanilla VAEs are generally low-quality (gure 17.12a–c). This is partly because of the
naïve spherical Gaussian noise model and partly because of the Gaussian models used
for the prior and variational posterior. One trick to improve generation quality is to
sample from the aggregated posterior q(z|θ) = (1/I)
P
i
q(z|x
i
, θ) rather than the prior;
this is the average posterior over all samples and is a mixture of Gaussians that is more
representative of true distribution in latent space.
Modern VAEs can produce high-quality samples (gure 17.12d), but only by using
hierarchical priors and specialized network architecture and regularization techniques.
Diusion models (chapter 18) can be viewed as VAEs with hierarchical priors. These
also create very high-quality samples.
17.8.3 Resynthesis
VAEs can also be used to modify real data. A data point x can be projected into the
latent space by either (i) taking the mean of the distribution predicted by the encoder
or (ii) by using an optimization procedure to nd the latent variable z that maximizes
the posterior probability, which Bayes’ rule tells us is proportional to P r(x|z)P r(z).
In gure 17.13, multiple images labeled as “neutral” or “smiling” are projected into
latent space. The vector representing this change is estimated by taking the dierence
in latent space between the means of these two groups. A second vector is estimated to
represent “mouth closed” versus “mouth open.
Now the image of interest is projected into the latent space, and then the repre-
sentation is modied by adding or subtracting these vectors. To generate intermediate
images, spherical linear interpolation or Slerp is used rather than linear interpolation.
Problem 17.6
In 3D, this would be the dierence between interpolating along the surface of a sphere
versus digging a straight tunnel through its body.
The process of encoding (and possibly modifying) input data before decoding again is
known as resynthesis. This can also be done with GANs and normalizing ows. However,
in GANs, there is no encoder, so a separate procedure must be used to nd the latent
variable that corresponds to the observed data.
17.8.4 Disentanglement
In the resynthesis example above, the directions in space representing interpretable prop-
erties had to be estimated using labeled training data. Other work attempts to improve
the characteristics of the latent space so that its coordinate directions correspond to real-
world properties. When each dimension represents an independent real-world factor, the
latent space is described as disentangled. For example, when modeling face images, we
might hope to uncover head pose or hair color as independent factors.
Methods to encourage disentanglement typically add regularization terms to the loss
function based on either (i) the posterior q(z|x, θ) over the latent variables z, or (ii) the
aggregated posterior q(z|θ) = (1/I)
P
i
q(z|x
i
, θ):
L
new
= ELBO[θ, ϕ] + λ
1
E
P r(x)
h
r
1
q(z|x, θ)
i
+ λ
2
r
2
q(z|θ)
. (17.29)
Draft: please send errata to udlbookmail@gmail.com.
342 17 Variational autoencoders
Figure 17.13 Resynthesis. The original image on the left is projected into the la-
tent space using the encoder, and the mean of the predicted Gaussian is chosen to
represent the image. The center-left image in the grid is the reconstruction of the
input. The other images are reconstructions after manipulating the latent space
in directions representing smiling/neutral (horizontal) and mouth open/closed
(vertical). Adapted from White (2016).
Here the regularization term r
1
[] is a function of the posterior and is weighted by λ
1
.
The term r
2
[] is a function of the aggregated posterior and is weighted by λ
2
.
For example, the beta VAE upweights the second term in the ELBO (equation 17.18):
ELBO[θ, ϕ] log
P r(x|z
, ϕ)
β · D
KL
h
q(z|x, θ)
P r(z)
i
, (17.30)
where β > 1 determines how much more the deviation from the prior P r(z) is weighted
relative to the reconstruction error. Since the prior is usually a multivariate normal with
a spherical covariance matrix, its dimensions are independent. Hence, up-weighting this
term encourages the posterior distributions to be less correlated. Another variant is the
total correlation VAE, which adds a term to decrease the total correlation between vari-
ables in the latent space (gure 17.14) and maximizes the mutual information between
a small subset of the latent variables and the observations.
17.9 Summary
The VAE is an architecture that helps to learn a nonlinear latent variable model over x.
This model can generate new examples by sampling from the latent variable, passing the
result through a deep network, and then adding independent Gaussian noise.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 343
Figure 17.14 Disentanglement in the total correlation VAE. The VAE model is
modied so that the loss function encourages the total correlation of the latent
variables to be minimized and hence encourages disentanglement. When trained
on a dataset of images of chairs, several of the latent dimensions have clear real-
world interpretations, including a) rotation, b) overall size, and c) legs (swivel
chair versus normal). In each case, the central column depicts samples from the
model, and as we move left to right, we are subtracting or adding a coordinate
vector in latent space. Adapted from Chen et al. (2018d).
It is not possible to compute the likelihood of a data point in closed form, and
this poses problems for training with maximum likelihood. However, we can dene a
lower bound on the likelihood and maximize this bound. Unfortunately, for the bound
to be tight, we need to compute the posterior probability of the latent variable given
the observed data, which is also intractable. The solution is to make a variational
approximation. This is a simpler distribution (usually a Gaussian) that approximates
the posterior and whose parameters are computed by a second encoder network.
To create high-quality samples from the VAE, it seems to be necessary to model the
latent space with more sophisticated probability distributions than the Gaussian prior
and posterior. One option is to use hierarchical priors (in which one latent variable
generates another). The next chapter discusses diusion models, which produce very
high-quality examples and can be viewed as hierarchical VAEs.
Notes
The VAE was originally introduced by Kingma & Welling (2014). A comprehensive introduction
to variational autoencoders can be found in Kingma et al. (2019).
Applications: The VAE and variants thereof have been applied to images (Kingma & Welling,
2014; Gregor et al., 2016; Gulrajani et al., 2016; Akuzawa et al., 2018), speech (Hsu et al., 2017b),
text (Bowman et al., 2015; Hu et al., 2017; Xu et al., 2020), molecules (Gómez-Bombarelli et al.,
Draft: please send errata to udlbookmail@gmail.com.
344 17 Variational autoencoders
2018; Sultan et al., 2018), graphs (Kipf & Welling, 2016; Simonovsky & Komodakis, 2018),
robotics (Hernández et al., 2018; Inoue et al., 2018; Park et al., 2018), reinforcement learning
(Heess et al., 2015; Van Hoof et al., 2016), 3D scenes (Eslami et al., 2016, 2018; Rezende Jimenez
et al., 2016), and handwriting (Chung et al., 2015).
Applications include resynthesis and interpolation (White, 2016; Bowman et al., 2015), collab-
orative ltering (Liang et al., 2018), and compression (Gregor et al., 2016). Gómez-Bombarelli
et al. (2018) use the VAE to construct a continuous representation of chemical structures that
can then be optimized for desirable properties. Ravanbakhsh et al. (2017) simulate astronomical
observations for calibrating measurements.
Relation to other models: The autoencoder (Rumelhart et al., 1985; Hinton & Salakhutdi-
nov, 2006) passes data through an encoder to a bottleneck layer and then reconstructs it using
a decoder. The bottleneck is similar to latent variables in the VAE, but the motivation diers.
Here, the goal is not to learn a probability distribution but to create a low-dimensional repre-
sentation that captures the essence of the data. Autoencoders also have various applications,
including denoising (Vincent et al., 2008) and anomaly detection (Zong et al., 2018).
If the encoder and decoder are linear transformations, the autoencoder is just principal compo-
nent analysis (PCA). Hence, the nonlinear autoencoder is a generalization of PCA. There are
also probabilistic forms of PCA. Probabilistic PCA (Tipping & Bishop, 1999) adds spherical
Gaussian noise to the reconstruction to create a probability model, and factor analysis adds
diagonal Gaussian noise (see Rubin & Thayer, 1982). If we make the encoder and decoder of
these probabilistic variants nonlinear, we return to the variational autoencoder.
Architectural variations: The conditional VAE (Sohn et al., 2015) passes class information c
into both the encoder and decoder. The result is that the latent space does not need to encode
the class information. For example, when MNIST data are conditioned on the digit label, the
latent variables might encode the orientation and width of the digit rather than the digit category
itself. Sønderby et al. (2016a) introduced ladder variational autoencoders, which recursively
correct the generative distribution with a data-dependent approximate likelihood term.
Modifying likelihood: Other work investigates more sophisticated likelihood models P r(x|z).
The PixelVAE (Gulrajani et al., 2016) used an autoregressive model over the output variables.
Dorta et al. (2018) modeled the covariance of the decoder output as well as the mean. Lamb
et al. (2016) improved the quality of reconstruction by adding extra regularization terms that
encourage the reconstruction to be similar to the original image in the space of activations
of a layer of an image classication model. This model encourages semantic information to
be retained and was used to generate the results in gure 17.13. Larsen et al. (2016) use an
adversarial loss for reconstruction, which also improves results.
Latent space, prior, and posterior: Many dierent forms for the variational approximation
to the posterior have been investigated, including normalizing ows (Rezende & Mohamed,
2015; Kingma et al., 2016), directed graphical models (Maaløe et al., 2016), undirected models
(Vahdat et al., 2020), and recursive models for temporal data (Gregor et al., 2016, 2019).
Other authors have investigated using a discrete latent space (Van Den Oord et al., 2017; Razavi
et al., 2019b; Rolfe, 2017; Vahdat et al., 2018a,b) For example, Razavi et al. (2019b) use a vector
quantized latent space and model the prior with an autoregressive model (equation 12.15). This
is slow to sample from but can describe very complex distributions.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 345
Jiang et al. (2016) use a mixture of Gaussians for the posterior, allowing clustering. This is a
hierarchical latent variable model that adds a discrete latent variable to improve the exibility
of the posterior. Other authors (Salimans et al., 2015; Ranganath et al., 2016; Maaløe et al.,
2016; Vahdat & Kautz, 2020) have experimented with hierarchical models that use continuous
variables. These have a close connection with diusion models (chapter 18).
Combination with other models: Gulrajani et al. (2016) combined VAEs with an autore-
gressive model to produce more realistic images. Chung et al. (2015) combine the VAE with
recurrent neural networks to model time-varying measurements.
As discussed above, adversarial losses have been used to inform the likelihood term directly.
However, other models have combined ideas from generative adversarial networks (GANs) with
VAEs in dierent ways. Makhzani et al. (2015) use an adversarial loss in the latent space;
the idea is that the discriminator will ensure that the aggregated posterior distribution q(z)
is indistinguishable from the prior distribution P r(z). Tolstikhin et al. (2018) generalize this
to a broader family of distances between the prior and aggregated posterior. Dumoulin et al.
(2017) introduced adversarially learned inference which uses an adversarial loss to distinguish
two pairs of latent/observed data points. In one case, the latent variable is drawn from the
latent posterior distribution and, in the other, from the prior. Other hybrids of VAEs and
GANs were proposed by Larsen et al. (2016), Brock et al. (2016), and Hsu et al. (2017a).
Posterior collapse: One potential problem in training is posterior collapse, in which the
encoder always predicts the prior distribution. This was identied by Bowman et al. (2015) and
can be mitigated by gradually increasing the term that encourages the KL distance between the
posterior and the prior to be small during training. Several other methods have been proposed
to prevent posterior collapse (Razavi et al., 2019a; Lucas et al., 2019b,a), and this is also part
of the motivation for using a discrete latent space (Van Den Oord et al., 2017).
Blurry reconstructions: Zhao et al. (2017c) provide evidence that the blurry reconstructions
are partly due to Gaussian noise and also because of the sub-optimal posterior distributions
induced by the variational approximation. It is perhaps not coincidental that some of the
best synthesis results have come from using a discrete latent space modeled by a sophisticated
autoregressive model (Razavi et al., 2019b) or from using hierarchical latent spaces (Vahdat &
Kautz, 2020; see gure 17.12d). Figure 17.12a-c used a VAE that was trained on the CELEBA
database (Liu et al., 2015). Figure 17.12d uses a hierarchical VAE that was trained on the
CELEBA HQ dataset (Karras et al., 2018).
Other problems: Chen et al. (2017) noted that when more complex likelihood terms are used,
such as the PixelCNN (Van den Oord et al., 2016c), the output can cease to depend on the
latent variables at all. They term this the information preference problem. This was addressed
by Zhao et al. (2017b) in the InfoVAE, which added an extra term that maximized the mutual
information between the latent and observed distributions.
Another problem with the VAE is that there can be “holes” in the latent space that do not
correspond to any realistic sample. Xu et al. (2020) introduce the constrained posterior VAE,
which helps prevent these vacant regions in latent space by adding a regularization term. This
allows for better interpolation from real samples.
Disentangling latent representation: Methods to “disentangle” the latent representation
include the beta VAE (Higgins et al., 2017) and others (e.g., Kim & Mnih, 2018; Kumar et al.,
Draft: please send errata to udlbookmail@gmail.com.
346 17 Variational autoencoders
Figure 17.15 Expectation maximization
(EM) algorithm. The EM algorithm al-
ternately adjusts the auxiliary parame-
ters θ (moves between colored curves)
and model parameters ϕ (moves along
colored curves) until the a maximum
is reached. These adjustments are
known as the E-step and the M-step,
respectively. Because the E-Step uses
the posterior distribution P r(h|x, ϕ)
for q(h|x, θ), the bound is tight, and
the colored curve touches the black like-
lihood curve after each E-Step.
2018). Chen et al. (2018d) further decomposed the ELBO to show the existence of a term
measuring the total correlation between the latent variables (i.e., the distance between the
aggregate posterior and the product of its marginals). They use this to motivate the total
correlation VAE, which attempts to minimize this quantity. The Factor VAE (Kim & Mnih,
2018) uses a dierent approach to minimize the total correlation. Mathieu et al. (2019) discuss
the factors that are important in disentangling representations.
Reparameterization trick: Consider computing an expectation of some function, where the
probability distribution with which the expectation is taken depends on some parameters. The
reparameterization trick computes the derivative of this expectation with respect to these pa-
rameters. This chapter introduced this as a method to dierentiate through the sampling
procedure approximating the expectation; there are alternative approaches (see problem 17.5),
but the reparameterization trick gives an estimator that (usually) has low variance. This issue
is discussed in Rezende et al. (2014), Kingma et al. (2015), and Roeder et al. (2017).
Lower bound and the EM algorithm: VAE training is based on optimizing the evidence
lower bound (sometimes also referred to as the ELBO, variational lower bound, or negative
variational free energy). Homan & Johnson (2016) and Lücke et al. (2020) re-express this
lower bound in several ways that elucidate its properties. Other work has aimed to make
this bound tighter (Burda et al., 2016; Li & Turner, 2016; Bornschein et al., 2016; Masrani
et al., 2019). For example, Burda et al. (2016) use a modied bound based on using multiple
importance-weighted samples from the approximate posterior to form the objective function.
The ELBO is tight when the distribution q(z|θ) matches the posterior P r(z|x, ϕ). This is
the basis of the expectation maximization (EM) algorithm (Dempster et al., 1977). Here, we
alternately (i) choose θ so that q(z|θ) equals the posterior P r(z|x, ϕ) and (ii) change ϕ to
Problem 17.7
maximize the lower bound (gure 17.15). This is viable for models like the mixture of Gaussians,
where we can compute the posterior distribution in closed form. Unfortunately, this is not the
case for the nonlinear latent variable model, so this method cannot be used.
Problems
Problem 17.1 How many parameters are needed to create a 1D mixture of Gaussians with n = 5
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 347
components (equation 17.4)? State the possible range of values that each parameter could take.
Problem 17.2 A function is concave if its second derivative is less than or equal to zero every-
where. Show that this is true for the function g[x] = log[x].
Problem 17.3 For convex functions, Jensen’s inequality works the other way around.
g
E[y]
E
g[y]
. (17.31)
A function is convex if its second derivative is greater than or equal to zero everywhere. Show
that the function g[x] = x
2n
is convex for arbitrary n [1, 2, 3, . . .]. Use this result with Jensen’s
inequality to show that the square of the mean E[x] of a distribution P r(x) must be less than
or equal to its second moment E[x
2
].
Problem 17.4
Show that the ELBO, as expressed in equation 17.18, can alternatively be de-
rived from the KL divergence between the variational distribution q(z|x) and the true posterior
distribution P r(z|x, ϕ):
D
KL
h
q(z|x)
P r(z|x, ϕ)
i
=
Z
q(z|x) log
q(z|x)
P r(z|x, ϕ)
dz. (17.32)
Start by using Bayes’ rule (equation 17.19).
Problem 17.5 The reparameterization trick computes the derivative of an expectation of a
function f[x]:
ϕ
E
P r(x|ϕ)
f[x]
, (17.33)
with respect to the parameters ϕ of the distribution P r(x|ϕ) that the expectation is over. Show
that this derivative can also be computed as:
ϕ
E
P r(x|ϕ)
f[x]
= E
P r(x|ϕ)
f[x]
ϕ
log
P r(x|ϕ)
1
I
I
X
i=1
f[x
i
]
ϕ
log
P r(x
i
|ϕ)
. (17.34)
This method is known as the REINFORCE algorithm or score function estimator.
Problem 17.6 Why is it better to use spherical linear interpolation rather than regular linear
interpolation when moving between points in the latent space? Hint: consider gure 8.13.
Problem 17.7
Derive the EM algorithm for the 1D mixture of Gaussians algorithm with N
components. To do this, you need to (i) nd an expression for the posterior distribution P r(z|x)
over the latent variable z {1, 2, . . . , N} for a data point x and (ii) nd an expression that
updates the evidence lower bound given the posterior distributions for all of the data points.
You will need to use Lagrange multipliers to ensure that the weights λ
1
, . . . , λ
N
of the Gaussians
sum to one.
Draft: please send errata to udlbookmail@gmail.com.
Chapter 18
Diusion models
Chapter 15 described generative adversarial models, which produce plausible-looking
samples but do not dene a probability distribution over the data. Chapter 16 discussed
normalizing ows. These do dene such a probability distribution but must place archi-
tectural constraints on the network; each layer must be invertible, and the determinant
of its Jacobian must be easy to calculate. Chapter 17 introduced variational autoen-
coders, which also have a solid probabilistic foundation but where the computation of
the likelihood is intractable and must be approximated by a lower bound.
This chapter introduces diusion models. Like normalizing ows, these are proba-
bilistic models that dene a nonlinear mapping from latent variables to the observed data
where both quantities have the same dimension. Like variational autoencoders, they ap-
proximate the data likelihood using a lower bound based on an encoder that maps to
the latent variable. However, in diusion models, this encoder is predetermined; the
goal is to learn a decoder that is the inverse of this process and can be used to produce
samples. Diusion models are easy to train and can produce very high-quality samples
that exceed the realism of those produced by GANs. The reader should be familiar with
variational autoencoders (chapter 17) before reading this chapter.
18.1 Overview
A diusion model consists of an encoder and a decoder. The encoder takes a data
sample x and maps it through a series of intermediate latent variables z
1
. . . z
T
. The
decoder reverses this process; it starts with z
T
and maps back through z
T 1
, . . . , z
1
until
it nally (re-)creates a data point x. In both encoder and decoder, the mappings are
stochastic rather than deterministic.
The encoder is prespecied; it gradually blends the input with samples of white noise
(gure 18.1). With enough steps, the conditional distribution q(z
T
|x) and marginal dis-
tribution q(z
T
) of the nal latent variable both become the standard normal distribution.
Since this process is prespecied, all the learned parameters are in the decoder.
In the decoder, a series of networks are trained to map backward between each
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
18.2 Encoder (forward process) 349
Figure 18.1 Diusion models. The encoder (forward, or diusion process) maps
the input x through a series of latent variables z
1
. . . z
T
. This process is pre-
specied and gradually mixes the data with noise until only noise remains. The
decoder (reverse process) is learned and passes the data back through the la-
tent variables, removing noise at each stage. After training, new examples are
generated by sampling noise vectors z
T
and passing them through the decoder.
adjacent pair of latent variables z
t
and z
t1
. The loss function encourages each network
to invert the corresponding encoder step. The result is that noise is gradually removed
from the representation until a realistic-looking data example remains. To generate a
new data example x, we draw a sample from q(z
T
) and pass it through the decoder.
In section 18.2, we consider the encoder in detail. Its properties are non-obvious
but are critical for the learning algorithm. In section 18.3, we discuss the decoder.
Section 18.4 derives the training algorithm, and section 18.5 reformulates it to be more
practical. Section 18.6 discusses implementation details, including how to make the
generation conditional on text prompts.
18.2 Encoder (forward process)
The diusion or forward process
1
(gure 18.2) maps a data example x through a series
of intermediate variables z
1
, z
2
, . . . , z
T
with the same size as x according to:
z
1
=
p
1 β
1
· x +
p
β
1
· ϵ
1
(18.1)
z
t
=
p
1 β
t
· z
t1
+
p
β
t
· ϵ
t
t 2, . . . , T,
where ϵ
t
is noise drawn from a standard normal distribution. The rst term attenuates
the data plus any noise added so far, and the second adds more noise. The hyperparam-
eters β
t
[0, 1] determine how quickly the noise is blended and are collectively known
as the noise schedule. The forward process can equivalently be written as:
1
Note, this is the opposite nomenclature to normalizing ows, where the inverse mapping moves
from the data to the latent variable, and the forward mapping moves back again.
Draft: please send errata to udlbookmail@gmail.com.
350 18 Diusion models
Figure 18.2 Forward process. a) We consider one-dimensional data x with T =
100 latent variables z
1
, . . . , z
100
and β = 0.03 at all steps. Three values of x
(gray, cyan, and orange) are initialized (top row). These are propagated
through z
1
, . . . , z
100
. At each step, the variable is updated by attenuating its
value by
1 β and adding noise with mean zero and variance β (equation 18.1).
Accordingly, the three examples noisily propagate through the variables with
a tendency to move toward zero. b) The conditional probabilities P r(z
1
|x)
and P r(z
t
|z
t1
) are normal distributions with a mean that is slightly closer to
zero than the current point and a xed variance β
t
(equation 18.2).
q(z
1
|x) = Norm
z
1
h
p
1 β
1
x, β
1
I
i
(18.2)
q(z
t
|z
t1
) = Norm
z
t
h
p
1 β
t
z
t1
, β
t
I
i
t {2, . . . , T }.
This is a Markov chain because the probability of z
t
is determined entirely by the value of
the immediately preceding variable
z
t1
. With sucient steps
T
, all traces of the original
data are removed, and q(z
T
|x) = q(z
T
) becomes a standard normal distribution.
2
Problem 18.1
The joint distribution of all of the latent variables z
1
, z
2
, . . . , z
T
given input x is:
q(z
1...T
|x) = q(z
1
|x)
T
Y
t=2
q(z
t
|z
t1
). (18.3)
2
We use q(z
t
|z
t1
) rather than P r(z
t
|z
t1
) to match the notation in the description of the VAE
encoder in the previous chapter.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
18.2 Encoder (forward process) 351
Figure 18.3 Diusion kernel. a) The point x
= 2.0 is propagated through the
latent variables using equation 18.1 (ve paths shown in gray). The diusion
kernel q(z
t
|x
) is the probability distribution over variable z
t
given that we started
from x
. It can be computed in closed-form and is a normal distribution whose
mean moves toward zero and whose variance increases as t increases. Heatmap
shows q(z
t
|x
) for each variable. Cyan lines show ±2 standard deviations from
the mean. b) The diusion kernel q(z
t
|x
) is shown explicitly for t = 20, 40, 80. In
practice, the diusion kernel allows us to sample a latent variable z
t
corresponding
to a given x
without computing the intermediate variables z
1
, . . . , z
t1
. When t
becomes very large, the diusion kernel becomes a standard normal.
Figure 18.4 Marginal distributions. a) Given an initial density P r(x) (top row),
the diusion process gradually blurs the distribution as it passes through the
latent variables z
t
and moves it toward a standard normal distribution. Each
subsequent horizontal line of heatmap represents a marginal distribution q(z
t
).
b) The top graph shows the initial distribution P r(x). The other two graphs
show the marginal distributions q(z
20
) and q(z
60
), respectively.
Draft: please send errata to udlbookmail@gmail.com.
352 18 Diusion models
18.2.1 Diusion kernel q(z
t
|x)
To train the decoder to invert this process, we use multiple samples z
t
at time t for the
same example x. However, generating these sequentially using equation 18.1 is time-
consuming when t is large. Fortunately, there is a closed-form expression for q(z
t
|x),
which allows us to directly draw samples z
t
given initial data point x without computing
the intermediate variables z
1
. . . z
t1
. This is known as the diusion kernel (gure 18.3).
To derive an expression for q(z
t
|x), consider the rst two steps of the forward process:
z
1
=
p
1 β
1
· x +
p
β
1
· ϵ
1
z
2
=
p
1 β
2
· z
1
+
p
β
2
· ϵ
2
. (18.4)
Substituting the rst equation into the second, we get:
z
2
=
p
1 β
2
p
1 β
1
· x +
p
β
1
· ϵ
1
+
p
β
2
· ϵ
2
(18.5)
=
p
1 β
2
p
1 β
1
· x +
p
1 (1 β
1
) · ϵ
1
+
p
β
2
· ϵ
2
=
p
(1 β
2
)(1 β
1
) · x +
p
1 β
2
(1 β
2
)(1 β
1
) · ϵ
1
+
p
β
2
· ϵ
2
.
The last two terms are independent samples from mean-zero normal distributions with
variances 1 β
2
(1 β
2
)(1 β
1
) and β
2
, respectively. The mean of this sum is zero,
Problem 18.2
and its variance is the sum of the component variances (see problem 18.2), so:
z
2
=
p
(1 β
2
)(1 β
1
) · x +
p
1 (1 β
2
)(1 β
1
) · ϵ, (18.6)
where ϵ is also a sample from a standard normal distribution.
If we continue this process by substituting this equation into the expression for z
3
and so on, we can show that:
Problem 18.3
z
t
=
α
t
· x +
1 α
t
· ϵ, (18.7)
where α
t
=
Q
t
s=1
1 β
s
. We can equivalently write this in probabilistic form:
q(z
t
|x) = Norm
z
t
h
α
t
· x, (1 α
t
)I
i
. (18.8)
For any starting data point x, variable z
t
is normally distributed with a known mean
and variance. Consequently, if we don’t care about the history of the evolution through
the intermediate variables z
1
. . . z
t1
, it is easy to generate samples from q(z
t
|x).
18.2.2 Marginal distributions q(z
t
)
The marginal distribution q(z
t
) is the probability of observing a value of z
t
given the
distribution of possible starting points x and the possible diusion paths for each starting
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
18.2 Encoder (forward process) 353
point (gure 18.4). It can be computed by considering the joint distribution q(x, z
1...t
)
Appendix C.1.2
Marginalization
and marginalizing over all the variables except z
t
:
q(z
t
) =
ZZ
q(z
1...t
, x)dz
1...t1
dx
=
ZZ
q(z
1...t
|x)P r(x)dz
1...t1
dx, (18.9)
where q(z
1...t
|x) was dened in equation 18.3.
However, since we now have an expression for the diusion kernel q(z
t
|x) that “skips”
the intervening variables, we can equivalently write:
q(z
t
) =
Z
q(z
t
|x)P r(x)dx. (18.10)
Hence, if we repeatedly sample from the data distribution P r(x) and superimpose the
diusion kernel q(z
t
|x) on each sample, the result is the marginal distribution q(z
t
) (g-
Notebook 18.1
Diusion encoder
ure 18.4). However, the marginal distribution cannot be written in closed form because
we don’t know the original data distribution P r(x).
18.2.3 Conditional distribution q(z
t1
|z
t
)
We dened the conditional probability q(z
t
|z
t1
) as the mixing process (equation 18.2).
Appendix C.1.4
Bayes’ rule
To reverse this process, we apply Bayes’ rule:
q(z
t1
|z
t
) =
q(z
t
|z
t1
)q(z
t1
)
q(z
t
)
. (18.11)
This is intractable since we cannot compute the marginal distribution q(z
t1
).
For this simple 1D example, it’s possible to evaluate q(z
t1
|z
t
) numerically (g-
ure 18.5). In general, their form is complex, but in many cases, they are well-approximated
by a normal distribution. This is important because when we build the decoder, we will
approximate the reverse process using a normal distribution.
18.2.4 Conditional diusion distribution q(z
t1
|z
t
, x)
There is one nal distribution related to the encoder to consider. We noted above that
we could not nd the conditional distribution q(z
t1
|z
t
) because we do not know the
marginal distribution q(z
t1
). However, if we know the starting variable x, then we
do know the distribution q(z
t1
|x) at the time before. This is just the diusion kernel
(gure 18.3), and it is normally distributed.
Hence, it is possible to compute the conditional diusion distribution q(z
t1
|z
t
, x)
in closed form (gure 18.6). This distribution is used to train the decoder. It is the
distribution over z
t1
when we know the current latent variable z
t
and the training
Draft: please send errata to udlbookmail@gmail.com.
354 18 Diusion models
Figure 18.5 Conditional distribution q(z
t1
|z
t
). a) The marginal densities q(z
t
)
with three points z
t
highlighted. b) The probability q(z
t1
|z
t
) (cyan curves) is
computed via Bayes’ rule and is proportional to q(z
t
|z
t1
)q(z
t1
). In general, it
is not normally distributed (top graph), although often the normal is a good ap-
proximation (bottom two graphs). The rst likelihood term q(z
t
|z
t1
) is normal
in z
t1
(equation 18.2) with a mean that is slightly further from zero than z
t
(brown curves). The second term is the marginal density q(z
t1
) (gray curves).
Figure 18.6 Conditional distribution q(z
t1
|z
t
, x). a) Diusion kernel for x
=
2.1 with three points z
t
highlighted. b) The probability q(z
t1
|z
t
, x
) is com-
puted via Bayes’ rule and is proportional to q(z
t
|z
t1
)q(z
t1
|x
). This is nor-
mally distributed and can be computed in closed form. The rst likelihood
term q(z
t
|z
t1
) is normal in z
t1
(equation 18.2) with a mean that is slightly
further from zero than z
t
(brown curves). The second term is the diusion ker-
nel q(z
t1
|x
) (gray curves).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
18.3 Decoder model (reverse process) 355
data example x (which, of course, we do when training). To compute an expression
for q(z
t1
|z
t
, x) we start with Bayes’ rule:
q(z
t1
|z
t
, x) =
q(z
t
|z
t1
, x)q(z
t1
|x)
q(z
t
|x)
(18.12)
q(z
t
|z
t1
)q(z
t1
|x)
= Norm
z
t
h
p
1 β
t
· z
t1
, β
t
I
i
Norm
z
t1
h
α
t1
· x, (1 α
t1
)I
i
Norm
z
t1
1
1 β
t
z
t
,
β
t
1 β
t
I
Norm
z
t1
h
α
t1
· x, (1 α
t1
)I
i
where between the rst two lines, we have used the fact that q(z
t
|z
t1
, x) = q(z
t
|z
t1
)
because the diusion process is Markov, and all information about z
t
is captured by z
t1
.
Between lines three and four, we use the Gaussian change of variables identity:
Appendix C.3.4
Gaussian change
of variables
Norm
v
[Aw, B] Norm
w
h
A
T
B
1
A
1
A
T
B
1
v,
A
T
B
1
A
1
i
, (18.13)
to rewrite the rst distribution in terms of z
t1
. We then use a second Gaussian identity:
Problems 18.4–18.5
Norm
w
[a, A] · Norm
w
[b, B] (18.14)
Norm
w
h
A
1
+ B
1
1
(A
1
a + B
1
b),
A
1
+ B
1
1
i
,
to combine the two normal distributions in z
t1
, which gives:
Problem 18.6
q(z
t1
|z
t
, x) = Norm
z
t1
(1 α
t1
)
1 α
t
p
1 β
t
z
t
+
α
t1
β
t
1 α
t
x,
β
t
(1 α
t1
)
1 α
t
I
.(18.15)
Note that the constants of proportionality in equations 18.12, 18.13, and 18.14 must
cancel out since the nal result is already a correctly normalized probability distribution.
18.3 Decoder model (reverse process)
When we learn a diusion model, we learn the reverse process. In other words, we learn a
series of probabilistic mappings back from latent variable z
T
to z
T
1
, from z
T
1
to z
T
2
,
and so on, until we reach the data x. The true reverse distributions q(z
t1
|z
t
) of the
diusion process are complex multi-modal distributions (gure 18.5) that depend on the
data distribution P r(x). We approximate these as normal distributions:
P r(z
T
) = Norm
z
T
[0, I]
P r(z
t1
|z
t
, ϕ
t
) = Norm
z
t1
h
f
t
[z
t
, ϕ
t
], σ
2
t
I
i
P r(x|z
1
, ϕ
1
) = Norm
x
h
f
1
[z
1
, ϕ
1
], σ
2
1
I
i
, (18.16)
Draft: please send errata to udlbookmail@gmail.com.
356 18 Diusion models
where f
t
[z
t
, ϕ
t
] is a neural network that computes the mean of the normal distribution in
the estimated mapping from z
t
to the preceding latent variable z
t1
. The terms {σ
2
t
} are
predetermined. If the hyperparameters β
t
in the diusion process are close to zero (and
the number of time steps T is large), then this normal approximation will be reasonable.
We generate new examples from P r(x) using ancestral sampling. We start by
drawing z
T
from P r(z
T
). Then we sample z
T 1
from P r(z
T 1
|z
T
, ϕ
T
), sample z
T 2
from P r(z
T 2
|z
T 1
, ϕ
T 1
) and so on until we nally generate x from P r(x|z
1
, ϕ
1
).
18.4 Training
The joint distribution of the observed variable x and the latent variables {z
t
} is:
P r(x, z
1...T
|ϕ
1...T
) = P r(x|z
1
, ϕ
1
)
T
Y
t=2
P r(z
t1
|z
t
, ϕ
t
) · Pr(z
T
). (18.17)
The likelihood of the observed data Pr(x|ϕ
1...T
) is found by marginalizing over the latent
Appendix C.1.2
Marginalization
variables:
P r(x|ϕ
1...T
) =
Z
P r(x, z
1...T
|ϕ
1...T
)dz
1...T
. (18.18)
To train the model, we maximize the log-likelihood of the training data {x
i
} with
respect to the parameters ϕ:
ˆ
ϕ
1...T
= argmax
ϕ
1...T
"
I
X
i=1
log
h
P r(x
i
|ϕ
1...T
)
i
#
. (18.19)
We can’t maximize this directly because the marginalization in equation 18.18 is in-
tractable. Hence, we use Jensen’s inequality to dene a lower bound on the likelihood
and optimize the parameters ϕ
1...T
with respect to this bound exactly as we did for the
VAE (see section 17.3.1).
18.4.1 Evidence lower bound (ELBO)
To derive the lower bound, we multiply and divide the log-likelihood by the encoder
distribution q(z
1...T
|x) and apply Jensen’s inequality (see section 17.3.2):
log [P r(x|ϕ
1...T
)] = log
Z
P r(x, z
1...T
|ϕ
1...T
)dz
1...T
= log
Z
q(z
1...T
|x)
P r(x, z
1...T
|ϕ
1...T
)
q(z
1...T
|x)
dz
1...T
Z
q(z
1...T
|x) log
P r(x, z
1...T
|ϕ
1...T
)
q(z
1...T
|x)
dz
1...T
. (18.20)
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
18.4 Training 357
This gives us the evidence lower bound (ELBO):
ELBO
ϕ
1...T
=
Z
q(z
1...T
|x) log
P r(x, z
1...T
|ϕ
1...T
)
q(z
1...T
|x)
dz
1...T
. (18.21)
In the VAE, the encoder q(z|x) approximates the posterior distribution over the latent
variables to make the bound tight, and the decoder maximizes this bound (gure 17.10).
In diusion models, the decoder must do all the work since the encoder has no parameters.
It makes the bound tighter by both (i) changing its parameters so that the static encoder
does approximate the posterior P r(z
1...T
|x, ϕ
1...T
) and (ii) optimizing its own parameters
with respect to that bound (see gure 17.6).
18.4.2 Simplifying the ELBO
We now manipulate the log term from the ELBO into the nal form that we will op-
timize. We rst substitute in the denitions for the numerator and denominator from
equations 18.17 and 18.3, respectively:
log
P r(x, z
1...T
|ϕ
1...T
)
q(z
1...T
|x)
= log
"
P r(x|z
1
, ϕ
1
)
Q
T
t=2
P r(z
t1
|z
t
, ϕ
t
) · Pr(z
T
)
q(z
1
|x)
Q
T
t=2
q(z
t
|z
t1
)
#
(18.22)
= log
P r(x|z
1
, ϕ
1
)
q(z
1
|x)
+log
"
Q
T
t=2
P r(z
t1
|z
t
, ϕ
t
)
Q
T
t=2
q(z
t
|z
t1
)
#
+log
h
P r(z
T
)
i
.
Then we expand the denominator of the second term:
q(z
t
|z
t1
) = q(z
t
|z
t1
, x) =
q(z
t1
|z
t
, x)q(z
t
|x)
q(z
t1
|x)
, (18.23)
where the rst equality follows because all of the information about variable z
t
is en-
compassed in z
t1
, so the extra conditioning on the data x is irrelevant. The second
Appendix C.1.4
Bayes’ rule
equality is a straightforward application of Bayes’ rule.
Substituting in this result gives:
log
P r(x, z
1...T
|ϕ
1...T
)
q(z
1...T
|x)
= log
P r(x|z
1
, ϕ
1
)
q(z
1
|x)
+ log
"
Q
T
t=2
P r(z
t1
|z
t
, ϕ
t
) · q(z
t1
|x)
Q
T
t=2
q(z
t1
|z
t
, x) · q(z
t
|x)
#
+ log
h
P r(z
T
)
i
= log [P r(x|z
1
, ϕ
1
)] + log
"
Q
T
t=2
P r(z
t1
|z
t
, ϕ
t
)
Q
T
t=2
q(z
t1
|z
t
, x)
#
+ log
P r(z
T
)
q(z
T
|x)
log [P r(x|z
1
, ϕ
1
)] +
T
X
t=2
log
P r(z
t1
|z
t
, ϕ
t
)
q(z
t1
|z
t
, x)
, (18.24)
Draft: please send errata to udlbookmail@gmail.com.
358 18 Diusion models
where all but two of the terms in the product of the ratios q(z
t1
|x)/q(z
t
|x) cancel out
between lines two and three leaving only q(z
1
|x) and q(z
T
|x). The last term in the
third line is approximately log[1] = 0 since the result of the forward process q(z
T
|x) is a
standard normal distribution, and so is equal to the prior P r(z
T
).
The simplied ELBO is hence:
ELBO
ϕ
1...T
(18.25)
=
Z
q(z
1...T
|x) log
P r(x, z
1...T
|ϕ
1...T
)
q
(
z
1...T
|
x
)
dz
1...T
Z
q(z
1...T
|x)
log [P r(x|z
1
, ϕ
1
)] +
T
X
t=2
log
P r(z
t1
|z
t
, ϕ
t
)
q(z
t1
|z
t
, x)
!
dz
1...T
= E
q (z
1
|x)
h
log [P r(x|z
1
, ϕ
1
)]
i
T
X
t=2
E
q( z
t
|x)
D
KL
h
q(z
t1
|z
t
, x)
P r(z
t1
|z
t
, ϕ
t
)
i
,
where we have marginalized over the irrelevant variables in q(z
1...T
|x) between lines two
Problem 18.7
Appendix C.5.1
KL divergence
and three and used the denition of KL divergence (see problem 18.7).
18.4.3 Analyzing the ELBO
The rst probability term in the ELBO was dened in equation 18.16:
P r(x|z
1
, ϕ
1
) = Norm
x
h
f
1
[z
1
, ϕ
1
], σ
2
1
I
i
, (18.26)
and is equivalent to the reconstruction term in the VAE. The ELBO will be larger if
the model prediction matches the observed data. As for the VAE, we will approximate
the expectation over the log of this quantity using a Monte Carlo estimate (see equa-
tions 17.22–17.23), in which we estimate the expectation with a sample from q(z
1
|x).
The KL divergence terms in the ELBO measure the distance between P r(z
t1
|z
t
, ϕ
t
)
and q(z
t1
|z
t
, x), which were dened in equations 18.16 and 18.15, respectively:
P r(z
t1
|z
t
, ϕ
t
) = Norm
z
t1
h
f
t
[z
t
, ϕ
t
], σ
2
t
I
i
(18.27)
q(z
t1
|z
t
, x) = Norm
z
t1
(1 α
t1
)
1 α
t
p
1 β
t
z
t
+
α
t1
β
t
1 α
t
x,
β
t
(1 α
t1
)
1 α
t
I
.
The
KL divergence between two normal distributions has a closed-form expression. More-
Appendix C.5.4
KL divergence
between normal
distributions
over, many of the terms in this expression do not depend on ϕ (see problem 18.8), and
Problem 18.8
the expression simplies to the squared dierence between the means plus a constant C:
D
KL
h
q(z
t1
|z
t
, x)
P r(z
t1
|z
t
, ϕ
t
)
i
= (18.28)
1
2σ
2
t
(1 α
t1
)
1 α
t
p
1 β
t
z
t
+
α
t1
β
t
1 α
t
x f
t
[z
t
, ϕ
t
]
2
+ C.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
18.4 Training 359
Figure 18.7 Fitted Model. a) Individual samples can be generated by sam-
pling from the standard normal distribution P r(z
T
) (bottom row) and then sam-
pling z
T 1
from P r(z
T 1
|z
T
) = Norm
z
T 1
[f
T
[z
T
, ϕ
T
], σ
2
T
I] and so on until we
reach x (ve paths shown). The estimated marginal densities (heatmap) are the
aggregation of these samples and are similar to the true marginal densities (g-
ure 18.4). b) The estimated distribution P r(z
t1
|z
t
) (brown curve) is a reasonable
approximation to the true posterior of the diusion model q(z
t1
|z
t
) (cyan curve)
from gure 18.5. The marginal distributions P r(z
t
) and q(z
t
) of the estimated
and true models (dark blue and gray curves, respectively) are also similar.
18.4.4 Diusion loss function
To t the model, we maximize the ELBO with respect to the parameters ϕ
1...T
. We
recast this as a minimization by multiplying with minus one and approximating the
expectations with samples to give the loss function:
L[ϕ
1...T
] =
I
X
i=1
reconstruction term
z }| {
log
h
Norm
x
i
f
1
[z
i1
, ϕ
1
], σ
2
1
I
i
(18.29)
+
T
X
t=2
1
2σ
2
t
1 α
t1
1 α
t
p
1 β
t
z
it
+
α
t1
β
t
1 α
t
x
i
| {z }
target, mean of q(z
t1
|z
t
, x)
f
t
[z
it
, ϕ
t
]
| {z }
predicted z
t1
2
,
where x
i
is the i
th
data point, and z
it
is the associated latent variable at diusion step t.
Draft: please send errata to udlbookmail@gmail.com.
360 18 Diusion models
Figure 18.8 Fitted model results. Cyan
and brown curves are original and esti-
mated densities and correspond to the
top rows of gures 18.4 and 18.7, re-
spectively. Vertical bars are binned sam-
ples from the model, generated by sam-
pling from P r(z
T
) and propagating back
through the variables z
T 1
, z
T 2
, . . . as
shown for the ve paths in gure 18.7.
18.4.5 Training procedure
This loss function can be used to train a network for each diusion time step. It minimizes
the dierence between the estimate f
t
[z
t
, ϕ
t
] of the hidden variable at the previous time
step and the most likely value that it took given the ground truth de-noised data x.
Figures 18.7 and 18.8 show the tted reverse process for the simple 1D example.
This model was trained by (i) taking a large dataset of examples x from the original
density, (ii) using the diusion kernel to predict many corresponding values for the latent
Notebook 18.2
1D diusion
model
variable z
t
at each time t, and then (iii) training the models f
t
[z
t
, ϕ
t
] to minimize the
loss function in equation 18.29. These models were nonparametric (i.e., lookup tables
relating 1D input to 1D output), but more typically, they would be deep neural networks.
18.5 Reparameterization of loss function
Although the loss function in equation 18.29 can be used, diusion models have been
found to work better with a dierent parameterization; the loss function is modied so
that the model aims to predict the noise that was mixed with the original data example
to create the current variable. Section 18.5.1 discusses reparameterizing the target (rst
two terms in second line of equation 18.29), and section 18.5.2 discusses reparameterizing
the network (last term in second line of equation 18.29).
18.5.1 Reparameterization of target
The original diusion update was given by:
z
t
=
α
t
· x +
1 α
t
· ϵ. (18.30)
It follows that the data term x in equation 18.28 can be expressed as the diused image
minus the noise that was added to it:
x =
1
α
t
· z
t
1 α
t
α
t
· ϵ. (18.31)
Substituting this into the target terms from equation 18.29 gives:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
18.5 Reparameterization of loss function 361
(1 α
t1
)
1 α
t
p
1 β
t
z
t
+
α
t1
β
t
1 α
t
x (18.32)
=
(1 α
t1
)
1 α
t
p
1 β
t
z
t
+
α
t1
β
t
1 α
t
1
α
t
z
t
1 α
t
α
t
ϵ
=
(1 α
t1
)
1 α
t
p
1 β
t
z
t
+
β
t
1 α
t
1
1 β
t
z
t
1 α
t
1 β
t
ϵ
,
where we have used the fact that
α
t
/
α
t1
=
1 β
t
between the second and third
Problem 18.9
lines. Simplifying further, we get:
(1 α
t1
)
1 α
t
p
1 β
t
z
t
+
α
t1
β
t
1 α
t
x (18.33)
=
(1 α
t1
)
1 β
t
1 α
t
+
β
t
(1 α
t
)
1 β
t
z
t
β
t
1 α
t
1 β
t
ϵ
=
(1 α
t1
)(1 β
t
)
(1 α
t
)
1 β
t
+
β
t
(1 α
t
)
1 β
t
z
t
β
t
1 α
t
1 β
t
ϵ
=
(1 α
t1
)(1 β
t
) + β
t
(1 α
t
)
1 β
t
z
t
β
t
1 α
t
1 β
t
ϵ
=
1 α
t
(1 α
t
)
1 β
t
z
t
β
t
1 α
t
1 β
t
ϵ
=
1
1 β
t
z
t
β
t
1 α
t
1 β
t
ϵ,
where we have multiplied the numerator and denominator of the rst term by
1 β
t
between lines two and three, multiplied out the terms, and simplied the numerator in
the rst term between lines three and four.
Problem 18.10
Substituting this back into the loss function (equation 18.29), we have:
L[ϕ
1...T
] =
I
X
i=1
log
h
Norm
x
i
f
1
[z
i1
, ϕ
1
], σ
2
1
I
i
(18.34)
+
T
X
t=2
1
2σ
2
t
1
1 β
t
z
it
β
t
1 α
t
1 β
t
ϵ
it
f
t
[z
it
, ϕ
t
]
2
.
18.5.2 Reparameterization of network
Now we replace the model
ˆ
z
t1
= f
t
[z
t
, ϕ
t
] with a new model
ˆ
ϵ = g
t
[z
t
, ϕ
t
], which
predicts the noise ϵ that was mixed with x to create z
t
:
f
t
[z
t
, ϕ
t
] =
1
1 β
t
z
t
β
t
1 α
t
1 β
t
g
t
[z
t
, ϕ
t
]. (18.35)
Draft: please send errata to udlbookmail@gmail.com.
362 18 Diusion models
Substituting the new model into equation 18.34 produces the criterion:
L[ϕ
1...T
] = (18.36)
I
X
i=1
log
h
Norm
x
i
f
1
[z
i1
, ϕ
1
], σ
2
1
I
i
+
T
X
t=2
β
2
t
(1 α
t
)(1 β
t
)2σ
2
t
g
t
[z
it
, ϕ
t
] ϵ
it
2
.
The log normal can be written as a least squares loss plus a constant C
i
(section 5.3.1):
L[ϕ
1...T
] =
I
X
i=1
1
2σ
2
1
x
i
f
1
[z
i1
, ϕ
1
]
2
+
T
X
t=2
β
2
t
(1 α
t
)(1 β
t
)2σ
2
t
g
t
[z
it
, ϕ
t
] ϵ
it
2
+ C
i
.
Substituting in the denitions of x and f
1
[z
1
, ϕ
1
] from equations 18.31 and 18.35, re-
Problem 18.11
spectively, the rst term simplies to:
1
2σ
2
1
x
i
f
1
[z
i1
, ϕ
1
]
2
=
1
2σ
2
1
β
1
1 α
1
1 β
1
g
1
[z
i1
, ϕ
1
]
β
1
1 α
1
1 β
1
ϵ
i1
2
.
(18.37)
Adding this back to the nal loss function yields:
L[ϕ
1...T
] =
I
X
i=1
T
X
t=1
β
2
t
(1 α
t
)(1 β
t
)2σ
2
t
g
t
[z
it
, ϕ
t
] ϵ
it
2
, (18.38)
where we have disregarded the additive constants C
i
.
In practice, the scaling factors (which might be dierent at each time step) are ig-
nored, giving an even simpler formulation:
L[ϕ
1...T
] =
I
X
i=1
T
X
t=1
g
t
[z
it
, ϕ
t
] ϵ
it
2
(18.39)
=
I
X
i=1
T
X
t=1
g
t
h
α
t
· x
i
+
1 α
t
· ϵ
it
, ϕ
t
i
ϵ
it
2
,
where we have rewritten z
t
using the diusion kernel (equation 18.30) in the second line.
18.6 Implementation
This leads to straightforward algorithms for both training the model (algorithm 18.1)
and sampling (algorithm 18.2). The training algorithm has the advantages that it is
(i) simple to implement and (ii) naturally augments the dataset; we can reuse every
Notebook 18.3
Reparameterized
model
original data point x
i
as many times as we want at each time step with dierent noise
instantiations ϵ. The sampling algorithm has the disadvantage that it requires serial
processing of many neural networks g
t
[z
t
, ϕ
t
] and is hence time-consuming.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
18.6 Implementation 363
Algorithm 18.1: Diusion model training
Input: Training data x
Output: Model parameters
ϕ
t
repeat
for i B do // For every training example index in batch
t Uniform[1, . . . T ] // Sample random timestep
ϵ Norm[0, I] // Sample noise
i
=
g
t
h
α
t
x
i
+
1 α
t
ϵ, ϕ
t
i
ϵ
2
// Compute individual loss
Accumulate losses for batch and take gradient step
until converged
Algorithm 18.2: Sampling
Input: Model, g
t
[, ϕ
t
]
Output: Sample, x
z
T
Norm
z
[0, I] // Sample last latent variable
for t = T . . . 2 do
ˆ
z
t1
=
1
1β
t
z
t
β
t
1α
t
1β
t
g
t
[z
t
, ϕ
t
] // Predict previous latent variable
ϵ Norm
ϵ
[0, I] // Draw new noise vector
z
t1
=
ˆ
z
t1
+ σ
t
ϵ // Add noise to previous latent variable
x =
1
1β
1
z
1
β
1
1α
1
1β
1
g
1
[z
1
, ϕ
1
] // Generate sample from z
1
without noise
18.6.1 Application to images
Diusion models have been very successful in modeling image data. Here, we need to
construct models that can take a noisy image and predict the noise that was added at
each step. The obvious architectural choice for this image-to-image mapping is the U-
Net (gure 11.10). However, there may be a very large number of diusion steps, and
training and storing multiple U-Nets is inecient. The solution is to train a single U-Net
that also takes a predetermined vector representing the time step as input (gure 18.9).
In practice, this is resized to match the number of channels at each stage of the U-Net
and used to oset and/or scale the representation at each spatial position.
A large number of time steps are needed as the conditional probabilities q(z
t1
|z
t
)
become closer to normal when the hyperparameters β
t
are close to zero, matching the
form of the decoder distributions P r(z
t1
|z
t
, ϕ
t
). However, this makes sampling slow.
We might have to run the U-Net model through T =1000 steps to generate good images.
18.6.2 Improving generation speed
The loss function (equation 18.39) requires the diusion kernel to have the form q(z
t
|x) =
Norm[
α
t
x,
1 α
t
· I]. The same loss function will be valid for any forward process
Draft: please send errata to udlbookmail@gmail.com.
364 18 Diusion models
Figure 18.9 U-Net as used in diusion models for images. The network aims to
predict the noise that was added to the image. It consists of an encoder which
reduces the scale and increases the number of channels and a decoder which in-
creases the scale and reduces the number of channels. The encoder representations
are concatenated to their partner in the decoder. Connections between adjacent
representations consist of residual blocks, and periodic global self-attention in
which every spatial position interacts with every other spatial position. A single
network is used for all time steps, by passing a sinusoidal time embedding (g-
ure 12.5) through a shallow neural network and adding the result to the channels
at every spatial position at every stage of the U-Net.
with this relation, and there is a family of such compatible processes. These are all
optimized by the same loss function but have dierent rules for the forward process and
dierent corresponding rules for how to use the estimated noise g[z
t
, ϕ
t
] to predict z
t1
from z
t
in the reverse process (gure 18.10).
Among this family are denoising diusion implicit models, which are no longer
stochastic after the rst step from x to z
1
, and accelerated sampling models, where the
forward process is dened only on a sub-sequence of time steps. This allows a reverse
process that skips time steps and hence makes sampling much more ecient; good sam-
Notebook 18.4
Families of
diusion models
ples can be created with 50 time steps when the forward process is no longer stochastic.
This is much faster than before but still slower than most other generative models.
18.6.3 Conditional generation
If the data has associated labels c, these can be exploited to control the generation.
Sometimes this can improve generation results in GANs, and we might expect this to
be the case in diusion models as well; it’s easier to denoise an image if you have some
information about what that image contains. One approach to conditional synthesis
in diusion models is classier guidance. This modies the denoising update from z
t
to z
t1
to take into account class information c. In practice, this means adding an extra
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
18.6 Implementation 365
Figure 18.10 Dierent diusion processes that are compatible with the same
model. a) Five sampled trajectories of the reparameterized model superimposed
on the ground truth marginal distributions. Top row represents P r(x) and sub-
sequent rows represent q(x
t
). b) Histogram of samples generated from reparam-
eterized model plotted alongside ground truth density curve P r(x). The same
trained model is compatible with a family of diusion models (and correspond-
ing updates in the opposite direction), including the denoising diusion implicit
(DDIM) model, which is deterministic and does not add noise at each step. c)
Five trajectories from DDIM model. d) Histogram of samples from DDIM model.
The same model is also compatible with accelerated diusion models that skip
inference steps for increased sampling speed. e) Five trajectories from accelerated
model. f) Histogram of samples from accelerated model.
term into the nal update step in algorithm 18.2 to yield:
z
t1
=
ˆ
z
t1
+ σ
2
t
log
Pr(c|z
t
)
z
t
+ σ
t
ϵ. (18.40)
The new term depends on the gradient of a classier P r(c|z
t
) that is based on the
latent variable z
t
. This maps features from the downsampling half of the U-Net to the
class c. Like the U-Net, it is usually shared across all time steps and takes time as an
input. The update from z
t
to z
t1
now makes the class c more likely.
Classier-free guidance avoids learning a separate classier P r(c|z
t
) but instead in-
corporates class information into the main model g
t
[z
t
, ϕ
t
, c]. In practice, this usually
takes the form of adding an embedding based on c to the layers of the U-Net in a similar
way to how the time step is added (see gure 18.9). This model is jointly trained on
conditional and unconditional objectives by randomly dropping the class information
Draft: please send errata to udlbookmail@gmail.com.
366 18 Diusion models
Figure 18.11 Cascaded conditional generation based on a text prompt. a) A diu-
sion model consisting of a series of U-Nets is used to generate a 64×64 image. b)
This generation is conditioned on a sentence embedding computed by a language
model. c) A higher resolution 256×256 image is generated and conditioned on the
smaller image and the text encoding. d) This is repeated to create a 1024×1024
image. e) Final image sequence. Adapted from Saharia et al. (2022b).
during training. Hence, it can both generate unconditional or conditional data examples
at test time or any weighted combination of the two. This brings a surprising advantage;
Problem 18.12
if the conditioning information is over-weighted, the model tends to produce very high
quality but slightly stereotypical examples. This is somewhat analogous to the use of
truncation in GANs (gure 15.10).
18.6.4 Improving generation quality
As for other generative models, the highest quality results result from applying a combi-
nation of tricks and extensions to the basic model. First, it’s been noted that it also helps
to estimate the variances σ
2
t
of the reverse process as well as the mean (i.e., the widths
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
18.7 Summary 367
of the brown normal distributions in gure 18.7). This particularly improves the results
when sampling with fewer steps. Second, it’s possible to modify the noise schedule in
the forward process so that β
t
varies at each step, and this can also improve results.
Third, to generate high-resolution images, a cascade of diusion models is used. The
rst creates a low-resolution image (possibly guided by class information). The subse-
quent diusion models generate progressively higher-resolution images. They condition
on the lower-resolution image by resizing this and appending it to the layers of the
constituent U-Net, as well as any other class information (gure 18.11).
Combining all of these techniques allows the generation of very high-quality images.
Figure 18.12 shows examples of images generated from a model conditioned on the Ima-
geNet class. It is particularly impressive that the same model can learn to generate such
diverse classes. Figure 18.13 shows images generated from a model that is trained to
condition on text captions encoded by a language model like BERT, which are inserted
into the model in the same way as the time step (gures 18.9 and 18.11). This results in
very realistic images that agree with the caption. Since the diusion model is stochastic
by nature, it’s possible to generate multiple images that are conditioned on the same
caption.
18.7 Summary
Diusion models map the data examples through a series of latent variables by repeat-
edly blending the current representation with random noise. After sucient steps, the
representation becomes indistinguishable from white noise. Since these steps are small,
the reverse denoising process at each step can be approximated with a normal distribu-
tion and predicted by a deep learning model. The loss function is based on the evidence
lower bound (ELBO) and ultimately results in a simple least-squares formulation.
For image generation, each denoising step is implemented using a U-Net, so sampling
is slow compared to other generative models. To improve generation speed, it’s possible
to change the diusion model to a deterministic formulation, and here sampling with
fewer steps works well. Several methods have been proposed to condition generation
on class information, images, and text information. Combining these methods produces
impressive text-to-image synthesis results.
Notes
Denoising diusion models were introduced by Sohl-Dickstein et al. (2015), and early related
work based on score-matching was carried out by Song & Ermon (2019). Ho et al. (2020)
produced image samples that were competitive with GANs and kick-started a wave of interest
in this area. Most of the exposition in this chapter, including the original formulation and
the reparameterization, is derived from this paper. Dhariwal & Nichol (2021) improved the
quality of these results and showed for the rst time that images from diusion models were
quantitatively superior to GAN models in terms of Fréchet Inception Distance. At the time
Draft: please send errata to udlbookmail@gmail.com.
368 18 Diusion models
Figure 18.12 Conditional generation using classier guidance. Image samples
conditioned on dierent ImageNet classes. The same model produces high quality
samples of highly varied image classes. Adapted from Dhariwal & Nichol (2021).
Figure 18.13 Conditional generation using text prompts. Synthesized images
from a cascaded generation framework, conditioned on a text prompt encoded by
a large language model. The stochastic model can produce many dierent images
compatible with the prompt. The model can count objects and incorporate text
into images. Adapted from Saharia et al. (2022b).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 369
of writing, the state-of-the-art results for conditional image synthesis have been achieved by
Karras et al. (2022). Surveys of denoising diusion models can be found in Croitoru et al.
(2022), Cao et al. (2022), Luo (2022), and Yang et al. (2022).
Applications for images: Applications of diusion models include text-to-image generation
(Nichol et al., 2022; Ramesh et al., 2022; Saharia et al., 2022b), image-to-image tasks such
as colorization, inpainting, uncropping and restoration (Saharia et al., 2022a), super-resolution
(Saharia et al., 2022c), image editing (Hertz et al., 2022; Meng et al., 2021), removing adversarial
perturbations (Nie et al., 2022), semantic segmentation (Baranchuk et al., 2022), and medical
imaging (Song et al., 2021b; Chung & Ye, 2022; Chung et al., 2022; Peng et al., 2022; Xie & Li,
2022; Luo et al., 2022) where the diusion model is sometimes used as a prior.
Dierent data types: Diusion models have also been applied to video data (Ho et al., 2022b;
Harvey et al., 2022; Yang et al., 2022; Höppe et al., 2022; Voleti et al., 2022) for generation, past
and future frame prediction, and interpolation. They have been used for 3D shape generation
(Zhou et al., 2021; Luo & Hu, 2021), and recently a technique has been introduced to generate
3D models using only a 2D text-to-image diusion model (Poole et al., 2023). Austin et al.
(2021) and Hoogeboom et al. (2021) investigated diusion models for discrete data. Kong et al.
(2021) and Chen et al. (2021d) applied diusion models to audio data.
Alternatives to denoising: The diusion models in this chapter mix noise with the data and
build a model to gradually denoise the result. However, degrading the image using noise is not
necessary. Rissanen et al. (2022) devised a method that progressively blurred the image and
Bansal et al. (2022) showed that the same ideas work with a large family of degradations that
do not have to be stochastic. These include masking, morphing, blurring, and pixelating.
Comparison to other generative models: Diusion models synthesize higher quality im-
ages than other generative models and are simple to train. They can be thought of as a special
case of a hierarchical VAE (Vahdat & Kautz, 2020; Sønderby et al., 2016b) where the encoder
is xed, and the latent space is the same size as the data. They are probabilistic, but in their
basic form, they can only compute a lower bound on the likelihood of a data point. However,
Kingma et al. (2021) show that this lower bound improves on the exact log-likelihoods for test
data from normalizing ows and autoregressive models. The likelihood for diusion models
can be computed by converting to an ordinary dierential equation (Song et al., 2021c) or by
training a continuous normalizing ow model with a diusion-based criterion (Lipman et al.,
2022). The main disadvantages of diusion models are that they are slow and that the latent
space has no semantic interpretation.
Improving quality: Many techniques have been proposed to improve image quality. These
include the reparameterization of the network described in section 18.5 and the equal weighting
of the subsequent terms (Ho et al., 2020). Choi et al. (2022) subsequently investigated dierent
weightings of terms in the loss function.
Kingma et al. (2021) improved the test log-likelihood of the model by learning the denoising
weights β
t
. Conversely, Nichol & Dhariwal (2021) improved performance by learning separate
variances σ
2
of the denoising estimate at each time step in addition to the mean. Bao et al.
(2022) show how to learn the variances after training the model.
Ho et al. (2022a) developed the cascaded method for producing very high-resolution images
(gure 18.11). To prevent artifacts in lower-resolution images from being propagated to higher
resolutions, they introduced noise conditioning augmentation; here, the lower-resolution condi-
tioning image is degraded by adding noise at each training step. This reduces the reliance on
the exact details of the lower-resolution image during training. It is also done during inference,
where the best noise level is chosen by sweeping over dierent values.
Draft: please send errata to udlbookmail@gmail.com.
370 18 Diusion models
Improving speed: One of the major drawbacks of diusion models is that they take a long
time to train and sample from. Stable diusion (Rombach et al., 2022) projects the original
data to a smaller latent space using a conventional autoencoder and then runs the diusion
process in this smaller space. This has the advantages of reducing the dimensionality of the
training data for the diusion process and allowing other data types (text, graphs, etc.) to be
described by diusion models. Vahdat et al. (2021) applied a similar approach.
Song et al. (2021a) showed that an entire family of diusion processes is compatible with the
training objective. Most of these processes are non-Markovian (i.e., the diusion step does not
only depend on the results of the previous step). One of these models is the denoising diusion
implicit model (DDIM), in which the updates are not stochastic (gure 18.10b). This model
is amenable to taking larger steps (gure 18.10b) without inducing large errors. It eectively
converts the model into an ordinary dierential equation (ODE) in which the trajectories have
low curvature and allows ecient numerical methods for solving ODEs to be applied.
Song et al. (2021c) propose converting the underlying stochastic dierential equations into a
probability ow ODE which has the same marginal distributions as the original process. Vahdat
et al. (2021), Xiao et al. (2022b), and Karras et al. (2022) all exploit techniques for solving ODEs
to speed up synthesis. Karras et al. (2022) identied the best-performing time discretization for
sampling and evaluated dierent sampler schedules. The result of these and other improvements
has been a signicant drop in steps required during synthesis.
Sampling is slow because many small diusion steps are required to ensure that the posterior
distribution
q
(
z
t1
|z
t
) is close to Gaussian (gure 18.5), so the Gaussian distribution in the
decoder is appropriate. If we use a model that describes a more complex distribution at each
denoising step, then we can use fewer diusion steps in the rst place. To this end, Xiao et al.
(2022b) have investigated using conditional GAN models, and Gao et al. (2021) investigated
using conditional energy-based models. Although these models cannot describe the original
data distribution, they suce to predict the (much simpler) reverse diusion step.
Salimans & Ho (2022) distilled adjacent steps of the denoising process into a single step to speed
up synthesis. Dockhorn et al. (2022) introduced momentum into the diusion process. This
makes the trajectories smoother and so more amenable to coarse sampling.
Conditional generation: Dhariwal & Nichol (2021) introduced classier guidance, in which a
classier learns to identify the category of object being synthesized at each step, and this is used
to bias the denoising update toward that class. This works well, but training a separate classier
is expensive. Classier-free guidance (Ho & Salimans, 2022) concurrently trains conditional and
unconditional denoising models by dropping the class information some proportion of the time
in a process akin to dropout. This technique allows control of the relative contributions of the
conditional and unconditional components. Over-weighting the conditional component causes
the model to produce more typical and realistic samples.
The standard technique for conditioning on images is to append the (resized) image to the
dierent layers of the U-Net. For example, this was used in the cascaded generation process
for super-resolution (Ho et al., 2022a). Choi et al. (2021) provide a method for conditioning
on images in an unconditional diusion model by matching the latent variables with those of
a conditioning image. The standard technique for conditioning on text is to linearly transform
the text embedding to the same size as the U-Net layer and then add it to the representation
in the same way that the time embedding is introduced (gure 18.9).
Existing diusion models can also be ne-tuned to be conditioned on edge maps, joint positions,
segmentation, depth maps, etc., using a neural network structure called a control network (Zhang
& Agrawala, 2023).
Text-to-image: Before diusion models, state-of-the-art text-to-image systems were based
on transformers (e.g., Ramesh et al., 2021). GLIDE (Nichol et al., 2022) and Dall·E 2 (Ramesh
et al., 2022) are both conditioned on embeddings from the CLIP model (Radford et al., 2021),
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 371
which generates joint embeddings for text and image data. Imagen (Saharia et al., 2022b)
showed that text embeddings from a large language model could produce even better results
(see gure 18.13). The same authors introduced a benchmark (DrawBench) which is designed
to evaluate the ability of a model to render colors, numbers of objects, spatial relations, and
other characteristics. Feng et al. (2022) developed a Chinese text-to-image model.
Connections to other models: This chapter described diusion models as hierarchical vari-
ational autoencoders because this approach connects most closely with the other parts of this
book. However, diusion models also have close connections with stochastic dierential equa-
tions (consider the paths in gure 18.5) and with score matching (Song & Ermon, 2019, 2020).
Song et al. (2021c) presented a framework based on stochastic dierential equations that en-
compasses both the denoising and score matching interpretations. Diusion models also have
close connections to normalizing ows (Zhang & Chen, 2021). Yang et al. (2022) present an
overview of the relationship between diusion models and other generative approaches.
Problems
Problem 18.1 Show that if Var[x
t1
] = I and we use the update:
x
t
=
p
1
β
t
· x
t1
+
p
β
t
· ϵ
t
, (18.41)
then Var
[
x
t
] = I, so the variance stays the same.
Problem 18.2 Consider the variable:
z = a · ϵ
1
+ b · ϵ
2
, (18.42)
where both ϵ
1
and ϵ
2
are drawn from independent standard normal distributions with mean
zero and unit variance. Show that:
E[z] = 0
Var[z] = a
2
+ b
2
, (18.43)
so we could equivalently compute z =
a
2
+ b
2
·ϵ where ϵ is also drawn from a standard normal
distribution.
Problem 18.3 Continue the process in equation 18.5 to show that:
z
3
=
p
(1 β
3
)(1 β
2
)(1 β
1
) · x +
p
1 (1 β
3
)(1 β
2
)(1 β
1
) · ϵ
, (18.44)
where ϵ
is a draw from a standard normal distribution.
Problem 18.4
Prove the relation:
Norm
v
[Aw, B] Norm
w
h
(A
T
B
1
A)
1
A
T
B
1
v, (A
T
B
1
A)
1
i
. (18.45)
Draft: please send errata to udlbookmail@gmail.com.
372 18 Diusion models
Problem 18.5
Prove the relation:
Norm
x
[a, A]Norm
x
[b, B] Norm
x
h
(A
1
+ B
1
)
1
(A
1
a + B
1
b), (A
1
+ B
1
)
1
i
.
(18.46)
Problem 18.6
Derive equation 18.15.
Problem 18.7
Derive the third line of equation 18.25 from the second line.
Problem 18.8
The KL-divergence between two normal distributions in D dimensions with
means a and b and covariance matrices A and B is given by:
D
KL
h
Norm
w
[a, A]
Norm
w
[b, B]
i
=
1
2
tr
B
1
A
d + (a b)
T
B
1
(a b) + log
|B|
|A|

.
(18.47)
Substitute the denitions from equation 18.27 into this expression and show that the only term
that depends on the parameters ϕ is the rst term from equation 18.28.
Problem 18.9
If α
t
=
Q
t
s=1
1 β
s
, then show that:
r
α
t
α
t1
=
p
1 β
t
. (18.48)
Problem 18.10
If α
t
=
Q
t
s=1
1 β
s
, then show that:
(1 α
t1
)(1 β
t
) + β
t
(1 α
t
)
1 β
t
=
1
1 β
t
. (18.49)
Problem 18.11
Prove equation 18.37.
Problem 18.12 Classier-free guidance allows us to create more stereotyped “canonical” images
of a given class. When we described transformer decoders, generative adversarial networks, and
the GLOW algorithm, we also discussed methods to reduce the amount of variation and produce
more stereotyped outputs. What were these? Do you think it’s inevitable that we should limit
the output of generative models in this way?
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Chapter 19
Reinforcement learning
Reinforcement learning (RL) is a sequential decision-making framework in which agents
learn to perform actions in an environment with the goal of maximizing received rewards.
For example, an RL algorithm might control the moves (actions) of a character (the
agent) in a video game (the environment), aiming to maximize the score (the reward).
In robotics, an RL algorithm might control the movements (actions) of a robot (the
agent) in the world (the environment) to perform a task (earning a reward). In nance,
an RL algorithm might control a virtual trader (the agent) who buys or sells assets (the
actions) on a trading platform (the environment) to maximize prot (the reward).
Consider learning to play chess. Here, there is a reward of +1, 1, or 0 at the end of
the game if the agent wins, loses, or draws and 0 at every other time step. This illustrates
the challenges of RL. First, the reward is sparse; here, we must play an entire game to
receive feedback. Second, the reward is temporally oset from the action that caused it;
a decisive advantage might be gained thirty moves before victory. We must associate the
reward with this critical action. This is termed the temporal credit assignment problem.
Third, the environment is stochastic; the opponent doesn’t always make the same move
in the same situation, so it’s hard to know if an action was truly good or just lucky.
Finally, the agent must balance exploring the environment (e.g., trying new opening
moves) with exploiting what it already knows (e.g., sticking to a previously successful
opening). This is termed the exploration-exploitation trade-o.
Reinforcement learning is an overarching framework that does not necessarily require
deep learning. However, in practice, state-of-the-art systems often use deep networks.
They encode the environment (the video game display, robot sensors, nancial time
series, or chessboard) and map this directly or indirectly to the next action (gure 1.13).
19.1 Markov decision processes, returns, and policies
Reinforcement learning maps observations of an environment to actions, aiming to maxi-
mize a numerical quantity that is connected to the rewards received. In the most common
case, we learn a policy that maximizes the expected return in a Markov decision process.
This section explains these terms.
Draft: please send errata to udlbookmail@gmail.com.
374 19 Reinforcement learning
Figure 19.1 Markov process. A Markov process consists of a set of states and tran-
sition probabilities P r(s
t+1
|s
t
) that dene the probability of moving to state s
t+1
given the current state is s
t
. a) The penguin can visit 16 dierent positions
(states) on the ice. b) The ice is slippery, so at each time, it has an equal proba-
bility of moving to any adjacent state. For example, in position 6, it has a 25%
chance of moving to states 2, 5, 7, and 10. A trajectory τ = [s
1
, s
2
, s
3
, . . .] from
this process consists of a sequence of states.
19.1.1 Markov process
A Markov process assumes that the world is always in one of a set of possible states.
The word Markov implies that the probability of being in a state depends only on the
previous state and not on the states before. The changes between states are captured by
the transition probabilities P r(s
t+1
|s
t
) of moving to the next state s
t+1
given the current
state s
t
, where t indexes the time step. Hence, a Markov process is an evolving system
that produces a sequence s
1
, s
2
, s
3
. . . of states (gure 19.1).
19.1.2 Markov reward process
A Markov reward process extends the Markov process to include a distribution P r(r
t+1
|s
t
)
over the possible rewards r
t+1
received at the next time step, given that we are in state s
t
.
Problem 19.1
This produces a sequence s
1
, r
2
, s
2
, r
3
, s
3
, r
4
. . . of states and the associated rewards (g-
ure 19.2). The Markov reward process also includes a discount factor γ (0, 1] that is
used to compute the return G
t
at time t:
G
t
=
X
k=0
γ
k
r
t+k+1
. (19.1)
The return is the sum of the cumulative discounted future rewards; it measures the future
benet of being on this trajectory. A discount factor of less than one makes rewards that
are closer in time more valuable than rewards that are further away.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
19.1 Markov decision processes, returns, and policies 375
Figure 19.2 Markov reward process. This associates a distribution P r(r
t+1
|s
t
)
of rewards r
t+1
with each state s
t
. a) Here, the rewards are deterministic; the
penguin will receive a reward of +1 if it lands on a sh and 0 otherwise. The
trajectory τ now consists of a sequence s
1
, r
2
, s
2
, r
3
, s
3
, r
4
. . . of alternating states
and rewards, terminating after eight steps. The return G
t
of the sequence is the
sum of discounted future rewards, here with discount factor γ = 0.9. b-c) As the
penguin proceeds along the trajectory and gets closer to reaching the rewards,
the return increases.
Figure 19.3 Markov decision process. a) The agent (penguin) can perform one
of a set of actions in each state. The action inuences both the probability of
moving to the successor state and the probability of receiving rewards. b) Here,
the four actions correspond to moving up, right, down, and left. c) For any state
(here, state 6), the action changes the probability of moving to the next state.
The penguin moves in the intended direction with 50% probability, but the ice is
slippery, so it may slide to one of the other adjacent positions with equal prob-
ability. Accordingly, in panel (a), the action taken (gray arrows) doesn’t always
line up with the trajectory (orange line). Here, the action does not aect the
reward, so P r(r
t+1
|s
t
, a
t
) = P r(r
t+1
|s
t
). The trajectory τ from an MDP consists
of a sequence s
1
, a
1
, r
2
, s
2
, a
2
, r
3
, s
3
, a
3
, r
4
. . . of alternating states s
t
, actions a
t
,
and rewards, r
t+1
. Note that here the penguin receives the reward when it leaves
a state with a sh (i.e., the reward is received for passing through the sh square,
regardless of whether the penguin arrived there intentionally or not).
Draft: please send errata to udlbookmail@gmail.com.
376 19 Reinforcement learning
Figure 19.4 Partially observable Markov
decision process (POMDP). In a
POMDP, the agent does not have access
to the entire state. Here, the penguin
is in state three and can only see the
region in the dashed box. This is
indistinguishable from what it would see
in state nine. In the rst case, moving
right leads to the hole in the ice (with
-2 reward) and, in the latter, to the sh
(with +3 reward).
Figure 19.5 Policies. a) A deterministic policy always chooses the same action in
each state (indicated by arrow). Some policies are better than others. This policy
is not optimal but still generally steers the penguin from top-left to bottom-right
where the reward lies. b) This policy is more random. c) A stochastic policy has
a probability distribution over actions for each state (probability indicated by
size of arrows). This has the advantage that the agent explores the states more
thoroughly and can be necessary for optimal performance in partially observable
Markov decision processes.
Figure 19.6 Reinforcement learning
loop. The agent takes an action a
t
at
time t based on the state s
t
, according
to the policy π[a
t
|s
t
]. This triggers
the generation of a new state s
t+1
(via
the state transition function) and a
reward r
t+1
(via the reward function).
Both are passed back to the agent,
which then chooses a new action.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
19.2 Expected return 377
19.1.3 Markov decision process
A Markov decision process or MDP adds a set of possible actions at each time step. The
action a
t
changes the transition probabilities, which are now written as P r(s
t+1
|s
t
, a
t
).
The rewards can also depend on the action and are now written as P r(r
t+1
|s
t
, a
t
). An
MDP produces a sequence s
1
, a
1
, r
2
, s
2
, a
2
, r
3
, s
3
, a
3
, r
4
. . . of states, actions, and rewards
(gure 19.3). The entity that performs the actions is known as the agent.
19.1.4 Partially observable Markov decision process
In a partially observable Markov decision process or POMDP, the state is not directly
visible (gure 19.4). Instead, the agent receives an observation o
t
drawn from P r(o
t
|s
t
).
Hence, a POMDP generates a sequence s
1
, o
1
, a
1
, r
2
, s
2
, o
2
, a
2
, r
3
, o
3
, a
3
, s
3
, r
4
, . . . of states,
observations, actions, and rewards. In general, each observation will be more compatible
with some states than others but insucient to identify the state uniquely.
19.1.5 Policy
The rules that determine the agent’s action for each state are known as the policy (g-
ure 19.5). This may be stochastic (the policy denes a distribution over actions for each
state) or deterministic (the agent always takes the same action in a given state). A
stochastic policy π[a|s] returns a probability distribution over each possible action a for
state s, from which a new action is sampled. A deterministic policy π[a|s] returns one for
the action a that is chosen for state s and zero otherwise. A stationary policy depends
only on the current state. A non-stationary policy also depends on the time step.
The environment and the agent form a loop (gure 19.6). The agent receives the
Notebook 19.1
Markov decision
processes
state s
t
and reward r
t
from the last time step. Based on this, it can modify the policy
π[a
t
|s
t
] if desired and choose the next action a
t
. The environment then assigns the next
state according to P r(s
t+1
|s
t
, a
t
) and the reward according to P r(r
t+1
|s
t
, a
t
).
19.2 Expected return
The previous section introduced the Markov decision process and the idea of an agent
carrying out actions according to a policy. We want to choose a policy that maximizes
the expected return. In this section, we make this idea mathematically precise. To do
that, we assign a value to each state s
t
and state-action pair {s
t
, a
t
}.
19.2.1 State and action values
The return G
t
depends on the state s
t
and the policy π[a|s]. From this state, the
agent will pass through a sequence of states, taking actions and receiving rewards. This
sequence diers every time the agent starts in the same place since, in general, the policy
Draft: please send errata to udlbookmail@gmail.com.
378 19 Reinforcement learning
Figure 19.7 State and action values. a) The value v[s
t
|π] of a state s
t
(number at
each position) is the expected return for this state for a given policy π (gray ar-
rows). It is the average sum of discounted rewards received over many trajectories
started from this state. Here, states closer to the sh are more valuable. b) The
value q[s
t
, a
t
, π] of an action a
t
in state s
t
(four numbers at each position/state
corresponding to four actions) is the expected return given that this particular
action is taken in this state. In this case, it gets larger as we get closer to the sh
and is larger for actions that head in the direction of the sh. c) If we know the
action values at a state, then the policy can be modied so that it chooses the
maximum of these values (red numbers in panel b).
π[a
t
|s
t
], the state transitions P r(s
t+1
|s
t
, a
t
), and the rewards issued P r(r
t+1
|s
t
, a
t
) are
all stochastic.
We can characterize how “good” a state is under a given policy π by considering
Appendix C.2
Expectation
the expected return v[s
t
|π]. This is the return that would be received on average from
sequences that start from this state and is termed the state value or state-value function
(gure 19.7a):
v[s
t
|π] = E
h
G
t
|s
t
, π
i
. (19.2)
Informally, the state value tells us the long-term reward we can expect on average if
we start in this state and follow the specied policy thereafter. It is highest for states
where it’s probable that subsequent transitions will bring large rewards soon (assuming
the discount factor γ is less than one).
Similarly, the action value or state-action value function q[s
t
, a
t
|π] is the expected
return from executing action a
t
in state s
t
(gure 19.7b):
q[s
t
, a
t
|π] = E
h
G
t
|s
t
, a
t
, π
i
. (19.3)
The action value tells us the long-term reward we can expect on average if we start in this
state, take this action, and follow the specied policy thereafter. Through this quantity,
reinforcement learning algorithms connect future rewards to current actions (i.e., resolve
the temporal credit assignment problem).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
19.2 Expected return 379
19.2.2 Optimal policy
We want a policy that maximizes the expected return. For MDPs (but not POMDPs),
there is always a deterministic, stationary policy that maximizes the value of every state.
If we know this optimal policy, then we get the optimal state-value function v
[s
t
]:
v
[s
t
] = max
π
E
h
G
t
|s
t
, π
i
. (19.4)
Similarly, the optimal state-action value function is obtained under the optimal policy:
q
[s
t
, a
t
] = max
π
h
E
h
G
t
|s
t
, a
t
, π
ii
. (19.5)
Turning this on its head, if we knew the optimal action-values q
[s
t
, a
t
], then we can
derive the optimal policy by choosing the action
a
t
with the highest value (gure 19.7c):
1
π[a
t
|s
t
] argmax
a
t
h
q
[s
t
, a
t
]
i
. (19.6)
Indeed, some reinforcement learning algorithms are based on alternately estimating the
action values and the policy (see section 19.3).
19.2.3 Bellman equations
We may not know the state values v[s
t
] or action values q[s
t
, a
t
] for any policy.
2
However,
we know that they must be consistent with one another, and it’s easy to write relations
between these quantities. The state value v[s
t
] can be found by taking a weighted sum
of the action values q[s
t
, a
t
], where the weights depend on the probability under the
policy π[a
t
|s
t
] of taking that action (gure 19.8):
v[s
t
] =
X
a
t
π[a
t
|s
t
]q[s
t
, a
t
]. (19.7)
Similarly, the value of an action is the immediate reward r
t+1
= r[s
t
, a
t
] generated by
taking the action, plus the value v[s
t+1
] of being in the subsequent state s
t+1
discounted
by γ (gure 19.9).
3
Since the assignment of s
t+1
is not deterministic, we weight the
values v[s
t+1
] according to the transition probabilities P r(s
t+1
|s
t
, a
t
):
q[s
t
, a
t
] = r[s
t
, a
t
] + γ ·
X
s
t+1
P r(s
t+1
|s
t
, a
t
)v[s
t+1
]. (19.8)
Substituting equation 19.8 into equation 19.7 provides a relation between the state
value at time t and t + 1:
1
The notation π[a
t
|s
t
] a in equations 19.6, 19.12, and 19.13 means set π[a
t
|s] to one for action a
and π[a
t
|s] to zero for other actions.
2
For simplicity, we will just write v[s
t
] and q[s
t
, a
t
] instead of v[s
t
|π] and q[s
t
, a
t
|π] from now on.
3
We also assume from now on that the rewards are deterministic and can be written as r[s
t
, a
t
].
Draft: please send errata to udlbookmail@gmail.com.
380 19 Reinforcement learning
Figure 19.8 Relationship between state values and action values. The value of
state six v[s
t
= 6] is a weighted sum of the action values q[s
t
= 6, a
t
] at state six,
where the weights are the policy probabilities π[a
t
|s
t
=6] of taking that action.
Figure 19.9 Relationship between action values and state values. The value q[s
t
=
6, a
t
=2] of taking action two in state six is the reward r[s
t
=6, a
t
=2] from taking
that action plus a weighted sum of the discounted values v[s
t+1
] of being in
successor states, where the weights are the transition probabilities P r(s
t+1
|s
t
=
6, a
t
= 2). The Bellman equations chain this relation with that of gure 19.8 to
link the current and next (i) state values and (ii) action values.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
19.3 Tabular reinforcement learning 381
v[s
t
] =
X
a
t
π[a
t
|s
t
]
r[s
t
, a
t
] + γ ·
X
s
t+1
P r(s
t+1
|s
t
, a
t
)v[s
t+1
]
. (19.9)
Similarly, substituting equation 19.7 into equation 19.8 provides a relation between the
action value at time t and t + 1:
q[s
t
, a
t
] = r[s
t
, a
t
] + γ ·
X
s
t+1
P r(s
t
+1
|s
t
, a
t
)
X
a
t+1
π[a
t
+1
|s
t
+1
]q[s
t
+1
, a
t
+1
]
. (19.10)
The latter two relations are the Bellman equations and are the backbone of many
RL methods. In short, they say that the state (action) values have to be self-consistent.
Consequently, when we update an estimate of one state (action) value, this will have a
ripple eect that causes modications to all the others.
19.3 Tabular reinforcement learning
Tabular RL algorithms (i.e., those that don’t rely on function approximation) are divided
into model-based and model-free methods. Model-based methods use the MDP structure
explicitly and nd the best policy from the transition matrix P r(s
t+1
|s
t
, a
t
) and reward
structure r[s, a]. If these are known, this is a straightforward optimization problem that
can be tackled using dynamic programming. If they are unknown, they must rst be
estimated from observed MDP trajectories.
4
Conversely, model-free methods eschew a model of the MDP and fall into two classes:
1. Value estimation approaches estimate the optimal state-action value function and
then assign the policy according to the action in each state with the greatest value.
2. Policy estimation approaches directly estimate the optimal policy using a gradient
descent technique without the intermediate steps of estimating the model or values.
Within each family, Monte Carlo methods simulate many trajectories through the MDP
for a given policy to gather information from which this policy can be improved. Some-
times it is not feasible or practical to simulate many trajectories before updating the
policy. Temporal dierence (TD) methods update the policy while the agent traverses
the MDP.
We now briey describe dynamic programming methods, Monte Carlo value esti-
mation methods, and TD value estimation methods. Section 19.4 describes how deep
networks have been used in TD value estimation methods. We return to policy estima-
tion in section 19.5.
4
In RL, a trajectory is an observed sequence of states, rewards, and actions. A rollout is a simulated
trajectory. An episode is a trajectory that starts in an initial state and ends in a terminal state (e.g., a
full game of chess starting from the standard opening position and ending in a win, lose, or draw.)
Draft: please send errata to udlbookmail@gmail.com.
382 19 Reinforcement learning
Figure 19.10 Dynamic programming. a) The state values are initialized to zero,
and the policy (arrows) is chosen randomly. b) The state values are updated to
be consistent with their neighbors (equation 19.11, shown after two iterations).
The policy is updated to move the agent to states with the highest value (equa-
tion 19.12). c) After several iterations, the algorithm converges to the optimal
policy, in which the penguin tries to avoid the holes and reach the sh.
19.3.1 Dynamic programming
Dynamic programming algorithms assume we have perfect knowledge of the transition
and reward structure. In this respect, they are distinguished from most RL algorithms
which observe the agent interacting with the environment to gather information about
these quantities indirectly.
The state values v[s] are initialized arbitrarily (usually to zero). The deterministic
policy π[a|s] is also initialized (e.g., by choosing a random action for each state). The
algorithm then alternates between iteratively computing the state values for the current
policy (policy evaluation) and improving that policy (policy improvement).
Policy evaluation: We sweep through the states s
t
, updating their values:
v[s
t
]
X
a
t
π[a
t
|s
t
]
r[s
t
, a
t
] + γ ·
X
s
t+1
P r(s
t+1
|s
t
, a
t
)v[s
t+1
]
, (19.11)
where s
t+1
is the successor state and P r(s
t+1
|s
t
, a
t
) is the state transition probability.
Each update makes v[s
t
] consistent with the value at the successor state s
t+1
using the
Bellman equation for state values (equation 19.9). This is termed bootstrapping.
Policy improvement: To update the policy, we greedily choose the action that maxi-
mizes the value for each state:
π[a
t
|s
t
] argmax
a
t
r[s
t
, a
t
] + γ ·
X
s
t+1
P r(s
t+1
|s
t
, a
t
)v[s
t+1
]
. (19.12)
This is guaranteed to improve the policy according to the policy improvement theorem.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
19.3 Tabular reinforcement learning 383
Figure 19.11 Monte Carlo methods. a) The policy (arrows) is initialized ran-
domly. The MDP is repeatedly simulated, and the trajectories of these episodes
are stored (orange and brown paths represent two trajectories). b) The action
values are empirically estimated based on the observed returns averaged over
these trajectories. In this case, the action values were all initially zero and have
been updated where an action was observed. c) The policy can then be updated
according to the action which received the best (or least bad) reward.
These two steps are iterated until the policy converges (gure 19.10).
Problems 19.2–19.3
There are many variations of this approach. In policy iteration, the policy evaluation
step is iterated until convergence before policy improvement. The values can be updated
either in place or synchronously in each sweep. In value iteration, the policy evaluation
Notebook 19.2
Dynamic
programming
procedure sweeps through the values just once before policy improvement. Asynchronous
dynamic programming algorithms don’t have to systematically sweep through all the
values at each step but can update a subset of the states in place in an arbitrary order.
19.3.2 Monte Carlo methods
Unlike dynamic programming algorithms, Monte Carlo methods don’t assume knowledge
of the MDP’s transition probabilities and reward structure. Instead, they gain experience
by repeatedly sampling trajectories from the MDP and observing the rewards. They
alternate between computing the action values (based on this experience) and updating
the policy (based on the action values).
To estimate the action values q[s, a], a series of episodes are run. Each starts with
a given state and action and thereafter follows the current policy, producing a series of
actions, states, and returns (gure 19.11a). The action value for a given state-action
pair under the current policy is estimated as the average of the empirical returns that
follow after each time this pair is observed (gure 19.11b). Then the policy is updated
by choosing the action with the maximum value at every state (gure 19.11c):
π[a|s] argmax
a
h
q[s, a]
i
. (19.13)
Draft: please send errata to udlbookmail@gmail.com.
384 19 Reinforcement learning
This is an on-policy method; the current best policy is used to guide the agent
through the environment. This policy is based on the observed action values in every
state, but of course, it’s not possible to estimate the value of actions that haven’t been
used, and there is nothing to encourage the algorithm to explore these. One solution is
to use exploring starts. Here, episodes with all possible state-action pairs are initiated, so
every combination is observed at least once. However, this is impractical if the number
of states is large or the starting point cannot be controlled. A dierent approach is
Problem 19.4
to use an epsilon greedy policy, in which a random action is taken with probability ϵ,
and the optimal action is allotted the remaining probability. The choice of ϵ trades o
exploitation and exploration. Here, an on-policy method will seek the best policy from
this epsilon-greedy family, which will not generally be the best overall policy.
Conversely, in o-policy methods, the optimal policy π (the target policy) is learned
based on episodes generated by a dierent behavior policy π
. Typically, the target
policy is deterministic, and the behavior policy is stochastic (e.g., an epsilon-greedy
policy). Hence, the behavior policy can explore the environment, but the learned target
Notebook 19.3
Monte Carlo
methods
policy remains ecient. Some o-policy methods explicitly use importance sampling
(section 17.8.1) to estimate the action value under policy π using samples from π
.
Others, such as Q-learning (described in the next section), estimate the values based
on the greedy action, even though this is not necessarily what was chosen.
19.3.3 Temporal dierence methods
Dynamic programming methods use a bootstrapping process to update the values to
make them self-consistent under the current policy. Monte Carlo methods sampled the
MDP to acquire information. Temporal dierence (TD) methods combine both boot-
strapping and sampling. However, unlike Monte Carlo methods, they update the values
and policy while the agent traverses the states of the MDP instead of afterward.
SARSA (State-Action-Reward-State-Action) is an on-policy algorithm with update:
q[s
t
, a
t
] q[s
t
, a
t
] + α
r[s
t
, a
t
] + γ · q[s
t+1
, a
t+1
] q[s
t
, a
t
]
, (19.14)
where α R
+
is the learning rate. The bracketed term is called the TD error and
measures the consistency between the estimated action value q[s
t
, a
t
] and the esti-
mate r[s
t
, a
t
]+γ · q[s
t+1
, a
t+1
] after taking a single step.
By contrast, Q-Learning is an o-policy algorithm with update (gure 19.12):
q[s
t
, a
t
] q[s
t
, a
t
] + α
r[s
t
, a
t
] + γ · max
a
q[s
t+1
, a]
q[s
t
, a
t
]
, (19.15)
where now the choice of action at each step is derived from a dierent behavior policy π
.
Notebook 19.4
Temporal dierence
methods
In both cases, the policy is updated by taking the maximum of the action values
at each state (equation 19.13). It can be shown that these updates are contraction
Problem 19.5
mappings (see equation 16.20); the action values will eventually converge, assuming that
every state-action pair is visited an innite number of times.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
19.4 Fitted Q-learning 385
Figure 19.12 Q-learning. a) The agent starts in state s
t
and takes action a
t
= 2
according to the policy. It does not slip on the ice and moves downward, receiving
reward r[s
t
, a
t
] = 0 for leaving the original state. b) The maximum action value
at the new state is found (here 0.43). c) The action value for action 2 in the
original state is updated to 1.12 based on the current estimate of the maximum
action value at the subsequent state, the reward, discount factor γ = 0.9, and
learning rate α = 0.1. This changes the highest action value at the original state,
so the policy changes.
19.4 Fitted Q-learning
The tabular Monte Carlo and TD algorithms described above repeatedly traverse the
entire MDP and update the action values. However, this is only practical if the state-
action space is small. Unfortunately, this is rarely the case; even for the constrained
environment of a chessboard, there are more than 10
40
possible legal states.
In tted Q-learning, the discrete representation q[s
t
, a
t
] of the action values is replaced
by a machine learning model q[s
t
, a
t
, ϕ], where now the state is represented by a vector
s
t
rather than just an index. We then dene a least squares loss based on the consistency
of adjacent action values (similarly to in Q-learning, see equation 19.15):
L[ϕ] =
r[s
t
, a
t
] + γ · max
a
h
q[s
t+1
, a, ϕ]
i
q[s
t
, a
t
, ϕ]
2
, (19.16)
which in turn leads to the update:
ϕ ϕ + α
r[s
t
, a
t
] + γ · max
a
h
q[s
t+1
, a, ϕ]
i
q[s
t
, a
t
, ϕ]
q[s
t
, a
t
, ϕ]
ϕ
. (19.17)
Fitted Q-learning diers from Q-Learning in that convergence is no longer guar-
anteed. A change to the parameters potentially modies both the target r[s
t
, a
t
] + γ ·
max
a
t+1
[q[s
t+1
, a
t+1
, ϕ]] (the maximum value may change) and the prediction q[s
t
, a
t
, ϕ].
This can be shown both theoretically and empirically to damage convergence.
Draft: please send errata to udlbookmail@gmail.com.
386 19 Reinforcement learning
Figure 19.13 Atari Benchmark. The Atari benchmark consists of 49 Atari 2600
games, including Breakout (pictured), Pong, and various shoot-em-up, platform,
and other types of games. a-d) Even for games with a single screen, the state
is not fully observable from a single frame because the velocity of the objects is
unknown. Consequently, it is usual to use several adjacent frames (here, four)
to represent the state. e) The action simulates the user input via a joystick. f)
There are eighteen actions corresponding to eight directions of movement or no
movement, and for each of these nine cases, the button being pressed or not.
19.4.1 Deep Q-networks for playing ATARI games
Deep networks are ideally suited to making predictions from a high-dimensional state
space, so they are a natural choice for the model in tted Q-learning. In principle, they
could take both state and action as input and predict the values, but in practice, the
network takes only the state and simultaneously predicts the values for each action.
The Deep Q-Network was a breakthrough reinforcement learning architecture that
exploited deep networks to learn to play ATARI 2600 games. The observed data com-
prises 220×160 images with 128 possible colors at each pixel (gure 19.13). This was
reshaped to size 84×84, and only the brightness value was retained. Unfortunately, the
full state is not observable from a single frame. For example, the velocity of game ob-
jects is unknown. To help resolve this problem, the network ingests the last four frames
at each time step to form s
t
. It maps these frames through three convolutional layers
followed by a fully connected layer to predict the value of every action (gure 19.14).
Several modications were made to the standard training procedure. First, the re-
wards (which were driven by the score in the game) were clipped to 1 for a negative
change and +1 for a positive change. This compensates for the wide variation in scores
between dierent games and allows the same learning rate to be used. Second, the
system exploited experience replay. Rather than update the network based on the tu-
ple < s
t
, a
t
, r
t+1
, s
t+1
> at the current step or with a batch of the last I tuples, all recent
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
19.4 Fitted Q-learning 387
Figure 19.14 Deep Q-network architec-
ture. The input s
t
consists of four adja-
cent frames of the ATARI game. Each
is resized to 84×84 and converted to
grayscale. These frames are represented
as four channels and processed by an 8×8
convolution with stride four, followed by
a 4×4 convolution with stride 2, followed
by two fully connected layers. The nal
output predicts the action value q[s
t
, a
t
]
for each of the 18 actions in this state.
tuples were stored in a buer. This buer was sampled randomly to generate a batch
at each step. This approach reuses data samples many times and reduces correlations
between the samples in the batch that arise due to the similarity of adjacent frames.
Finally, the issue of convergence in tted Q-Networks was tackled by xing the target
parameters to values ϕ
and only updating them periodically. This gives the update:
ϕ ϕ + α
r[s
t
, a
t
] + γ · max
a
h
q[s
t+1
, a, ϕ
]
i
q[s
t
, a
t
, ϕ]
q[s
t
, a
t
, ϕ]
ϕ
. (19.18)
Now the network no longer chases a moving target and is less prone to oscillation.
Using these and other heuristics and with an ϵ-greedy policy, Deep Q-Networks per-
formed at a level comparable to a professional game tester across a set of 49 games using
the same network (trained separately for each game). It should be noted that the train-
ing process was data-intensive. It took around 38 full days of experience to learn each
game. In some games, the algorithm exceeded human performance. On other games
like “Montezuma’s Revenge,” it barely made any progress. This game features sparse
rewards and multiple screens with quite dierent appearances.
19.4.2 Double Q-learning and double deep Q-networks
One potential aw of Q-Learning is that the maximization over the actions in the update:
q[s
t
, a
t
] q[s
t
, a
t
] + α
r[s
t
, a
t
] + γ · max
a
q[s
t+1
, a]
q[s
t
, a
t
]
(19.19)
leads to a systematic bias in the estimated action values q[s
t
, a
t
]. Consider two actions
that provide the same average reward, but one is stochastic and the other deterministic.
The stochastic reward will exceed the average roughly half of the time and be chosen
by the maximum operation, causing the corresponding action value q[s
t
, a
t
] to be over-
estimated. A similar argument can be made about random inaccuracies in the output of
the network q[s
t
, a
t
, ϕ] or random initializations of the q-function.
Draft: please send errata to udlbookmail@gmail.com.
388 19 Reinforcement learning
The underlying problem is that the same network both selects the target (by the
maximization operation) and updates the value. Double Q-Learning tackles this problem
by training two models q
1
[s
t
, a
t
, π
1
] and q
2
[s
t
, a
t
, π
2
] simultaneously:
q
1
[s
t
, a
t
] q
1
[s
t
, a
t
] + α
r[s
t
, a
t
] + γ · q
2
s
t+1
, argmax
a
h
q
1
[s
t+1
, a]
i
q
1
[s
t
, a
t
]
q
2
[s
t
, a
t
] q
2
[s
t
, a
t
] + α
r[s
t
, a
t
] + γ · q
1
s
t+1
, argmax
a
h
q
2
[s
t+1
, a]
i
q
2
[s
t
, a
t
]
.
(19.20)
Now the choice of the target and the target itself are decoupled, which helps prevent
these biases. In practice, new tuples < s, a, r, s
> are randomly assigned to update one
model or another. This is known as double Q-learning. Double deep Q-networks or double
DQNs use deep networks q[s
t
, a
t
, ϕ
1
] and q[s
t
, a
t
, ϕ
2
] to estimate the action values, and
the update becomes:
ϕ
1
ϕ
1
+α
r[s
t
, a
t
]+γ ·q
s
t+1
, argmax
a
h
q[s
t+1
, a, ϕ
1
]
i
, ϕ
2
q[s
t
, a
t
, ϕ
1
]
q[s
t
, a
t
, ϕ
1
]
ϕ
1
ϕ
2
ϕ
2
+α
r[s
t
, a
t
]+γ ·q
s
t+1
, argmax
a
h
q[s
t+1
, a, ϕ
2
]
i
, ϕ
1
q[s
t
, a
t
, ϕ
2
]
q[s
t
, a
t
, ϕ
2
]
ϕ
2
.
(19.21)
19.5 Policy gradient methods
Q-learning estimates the action values rst and then uses these to update the policy.
Conversely, policy-based methods directly learn a stochastic policy π[a
t
|s
t
, θ]. This is a
function with trainable parameters θ that maps a state s
t
to a distribution P r(a
t
|s
t
) over
actions a
t
from which we can sample. In MDPs, there is always an optimal deterministic
policy. However, there are three reasons to use a stochastic policy:
1. A stochastic policy naturally helps with exploration of the space; we are not obliged
to take the best action at each time step.
2. The loss changes smoothly as we modify a stochastic policy. This means we can use
gradient descent methods even though the rewards are discrete. This is similar to
using maximum likelihood in (discrete) classication problems. The loss changes
smoothly as the model parameters change to make the true class more likely.
3. The MDP assumption is often incorrect; we usually don’t have complete knowl-
edge of the state. For example, consider an agent navigating in an environment
where it can only observe nearby locations (e.g., gure 19.4). If two locations look
identical, but the nearby reward structure is dierent, a stochastic policy allows
the possibility of taking dierent actions until this ambiguity is resolved.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
19.5 Policy gradient methods 389
19.5.1 Derivation of gradient update
Consider a trajectory τ = [s
1
, a
1
, s
2
, a
2
, . . . , s
T
, a
T
] through an MDP. The probability of
this trajectory P r(τ |θ) depends on both the state evolution function P r(s
t+1
|s
t
, a
t
) and
the current stochastic policy π[a
t
|s
t
, θ]:
P r(τ |θ) = P r(s
1
)
T
Y
t=1
π[a
t
|s
t
, θ]P r(s
t+1
|s
t
, a
t
). (19.22)
Policy gradient algorithms aim to maximize the expected return r[τ ] over many such
trajectories:
θ = argmax
θ
E
τ
h
r[τ ]
i
= argmax
θ
Z
P r(τ |θ)r[τ ]dτ
, (19.23)
where the return is the sum of all the rewards received along the trajectory.
To maximize this quantity, we use the gradient ascent update:
θ θ + α ·
θ
Z
P r(τ |θ)r[τ ]dτ
= θ + α ·
Z
P r(τ |θ)
θ
r[τ ]dτ . (19.24)
where α is the learning rate.
We want to approximate this integral with a sum over empirically observed trajecto-
ries. These are drawn from the distribution P r(τ |θ), so to make progress, we multiply
and divide the integrand by this distribution:
θ θ + α ·
Z
P r(τ |θ)
θ
r[τ ]dτ
= θ + α ·
Z
P r(τ |θ)
1
P r(τ |θ)
P r(τ |θ)
θ
r[τ ]dτ
θ + α ·
1
I
I
X
i=1
1
P r(τ
i
|θ)
P r(τ
i
|θ)
θ
r[τ
i
]. (19.25)
This equation has a simple interpretation (gure 19.15); the update changes the pa-
rameters θ to increase the likelihood P r(τ
i
|θ) of an observed trajectory τ
i
in proportion
to the reward r[τ
i
] from that trajectory. However, it also normalizes by the probabil-
ity of observing that trajectory in the rst place to compensate for the fact that some
trajectories are observed more often than others. If a trajectory is already common and
yields high rewards, then we don’t need to change much. The biggest updates will come
from trajectories that are uncommon but create large rewards.
We can simplify this expression using the likelihood ratio identity:
Draft: please send errata to udlbookmail@gmail.com.
390 19 Reinforcement learning
Figure 19.15 Policy gradients. Five
episodes for the same policy (brighter in-
dicates higher reward). Trajectories 1,
2, and 3 generate consistently high re-
wards, but similar trajectories already
frequently occur with this policy, so
there is no need to change. Conversely,
trajectory 4 receives low rewards, so the
policy should be modied to avoid pro-
ducing similar trajectories. Trajectory 5
receives high rewards and is unusual.
This will cause the largest change to the
policy under equation 19.25.
log[f[z]]
z
=
1
f[z]
f[z]
z
, (19.26)
which yields the update:
θ θ + α ·
1
I
I
X
i=1
log
P r(τ
i
|θ)
θ
r[τ
i
]. (19.27)
The log probability log[P r(τ |θ)] of a trajectory is given by:
log[P r(τ |θ)] = log
h
P r(s
1
)
T
Y
t=1
π[a
t
|s
t
, θ]P r(s
t+1
|s
t
, a
t
)
i
(19.28)
= log
P r(s
1
)
+
T
X
t=1
log
π[a
t
|s
t
, θ]
+
T
X
t=1
log
P r(s
t+1
|s
t
, a
t
)
,
and noting that only the center term depends on θ, we can rewrite the update from
equation 19.27 as:
θ θ + α ·
1
I
I
X
i=1
T
X
t=1
log
π[a
it
|s
it
, θ]
θ
r[τ
i
],
(19.29)
where s
it
is the state at time t in episode i, and a
it
is the action taken at time t
in episode i. Note that since the terms relating to the state evolution P r(s
t+1
|s
t
, a
t
)
disappear, this parameter update does not assume a Markov time evolution process.
We can further simplify this by noting that:
r[τ
i
] =
T
X
t=1
r
it
=
t1
X
k=1
r
ik
+
T
X
k=t
r
ik
, (19.30)
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
19.5 Policy gradient methods 391
where r
it
is the reward at time t in the i
th
episode. The rst term (the rewards before
time t) does not aect the update from time t, so we can write:
θ θ + α ·
1
I
I
X
i=1
T
X
t=1
log
π[a
it
|s
it
, θ]
θ
T
X
k=t
r
ik
. (19.31)
19.5.2 REINFORCE algorithm
REINFORCE is an early policy gradient algorithm that exploits this result and in-
corporates discounting. It is a Monte Carlo method that generates episodes τ
i
=
[s
i1
, a
i1
, r
i2
, s
i2
, a
i2
, r
i3
, . . . , r
iT
] based on the current policy π[a|s, θ]. For discrete ac-
tions, this policy could be determined by a neural network π[s|θ], which takes the cur-
rent state s and returns one output for each possible action. These outputs are passed
through a softmax function to create a distribution over actions, which is sampled at
each time step.
For each episode i, we loop through each step t and calculate the empirical discounted
return for the partial trajectory τ
it
that starts at time t:
r[τ
it
] =
T
X
k=t+1
γ
kt1
r
ik
, (19.32)
and then we update the parameters for each time step t in each trajectory:
θ θ + α ·γ
t
log
π
a
it
[s
it
, θ]
θ
r[τ
it
] i, t, (19.33)
where π
a
t
[s
t
, θ] is the probability of a
t
produced by the neural network given the current
state s
t
and parameters θ, and α is the learning rate. The extra term γ
t
ensures that
the rewards are discounted relative to the start of the sequence because we maximize the
log probability of returns in the whole sequence (equation 19.23).
19.5.3 Baselines
Policy gradient methods have the drawback that they exhibit high variance; many
episodes may be needed to get stable updates of the derivatives. One way to reduce
this variance is to subtract the trajectory returns r[τ ] from a baseline b:
θ θ + α ·
1
I
I
X
i=1
T
X
t=1
log
π
a
it
[s
it
, θ]
θ
(r[τ
it
] b) . (19.34)
As long as the baseline b doesn’t depend on the actions:
Problem 19.6
Draft: please send errata to udlbookmail@gmail.com.
392 19 Reinforcement learning
Figure 19.16 Decreasing variance of estimates using control variates. a) Consider
trying to estimate E[a] from a small number of samples. The estimate (the mean
of the samples) will vary based on the number of samples and the variance of those
samples. b) Now consider observing another variable b that co-varies with a and
has E[b] = 0 and the same variance as a. c) The variance of the samples of a b
is much less than that of a, but the expected value E[a b] = E[a], so we get an
estimator with lower variance.
E
τ
"
T
X
t=1
log
π
a
it
[s
it
, θ]
θ
· b
#
= 0, (19.35)
and the expected value will not change. However, if the baseline co-varies with irrelevant
Notebook 19.5
Control variates
factors that add uncertainty, then subtracting it reduces the variance (gure 19.16). This
is a special case of the method of control variates (see problem 19.7).
Problem 19.7
This raises the question of how we should choose b. We can nd the value of b that
minimizes the variance by writing an expression for the variance, taking the derivative
with respect to b, setting the result to zero, and solving to yield:
Problem 19.8
b =
X
i
P
T
t=1
log
π
a
it
[s
it
, θ]
/∂θ
2
r[τ
it
]
P
T
t=1
log
π
a
it
[s
it
, θ]
/∂θ
2
. (19.36)
In practice, this is often approximated as:
b =
1
I
X
i
r[τ
i
]. (19.37)
Subtracting this baseline factors out variance that might occur when the returns r[τ
i
]
from all trajectories are greater than is typical but only because they happen to pass
through states with higher than average returns whatever actions are taken.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
19.6 Actor-critic methods 393
19.5.4 State-dependent baselines
A better option is to use a baseline b[s
it
] that depends on the current state s
it
.
θ θ + α ·
1
I
I
X
i=1
T
X
t=1
log
π
a
it
[s
it
, θ]
θ
(r[τ
it
] b[s
it
]) . (19.38)
Here, we are compensating for variance introduced by some states having greater overall
returns than others, whichever actions we take.
A sensible choice is the expected future reward based on the current state, which is
just the state value v[s]. In this case, the dierence between the empirically observed re-
wards and the baseline is known as the advantage estimate. Since we are in a Monte Carlo
context, this can be parameterized by a neural network b[s] = v[s, ϕ] with parameters ϕ,
which we can t to the observed returns using least squares loss:
L[ϕ] =
I
X
i=1
T
X
t=1
v[s
it
, ϕ]
T
X
j=y
r
ij
2
. (19.39)
19.6 Actor-critic methods
Actor-critic algorithms are temporal dierence (TD) policy gradient algorithms. They
can update the parameters of the policy network at each step. This contrasts with
the Monte Carlo REINFORCE algorithm, which must wait for one or more episodes to
complete before updating the parameters.
In the TD approach, we do not have access to the future rewards r[τ
t
] =
P
T
k=t
r
k
along this trajectory. Actor-critic algorithms approximate the sum over all the future
rewards with the observed current reward plus the discounted value of the next state:
r[τ
it
] r
it
+ γ · v[s
i,t+1
, ϕ]. (19.40)
Here the value v[s
i,t+1
, ϕ] is estimated by a second neural network with parameters ϕ.
Substituting this into equation 19.38 gives the update:
θ θ + α ·
1
I
I
X
i=1
T
X
t=1
log
P r(a
it
|s
it
, θ)]
θ
r
it
+ γ · v[s
i,t+1
, ϕ] v[s
i,t
, ϕ]
. (19.41)
Concurrently, we update the parameters ϕ by bootstrapping using the loss function:
L[ϕ] =
I
X
i=1
T
X
t=1
(r
it
+ γ · v[s
i,t+1
, ϕ] v[s
i,t
, ϕ])
2
. (19.42)
The policy network π[s
t
, θ] that predicts P r(a|s
t
) is termed the actor. The value
network v[s
t
, ϕ] is termed the critic. Often the same network represents both actor and
Draft: please send errata to udlbookmail@gmail.com.
394 19 Reinforcement learning
Figure 19.17 Decision transformer. The decision transformer treats oine rein-
forcement learning as a sequence prediction task. The input is a sequence of states,
actions, and returns-to-go (remaining rewards in the episode), each of which is
mapped to a xed-size embedding. At each time step, the network predicts the
next action. During testing, the returns-to-go are unknown; in practice, an initial
estimate is made from which subsequent observed rewards are subtracted.
the critic, with two sets of outputs that predict the policy and the values, respectively.
Note that although actor-critic methods can update the policy parameters at each step,
this is rarely done in practice. The agent typically collects a batch of experience over
many time steps before the policy is updated.
19.7 Oine reinforcement learning
Interaction with the environment is at the core of reinforcement learning. However, there
are some scenarios where it is not practical to send a naïve agent into an environment
to explore the eect of dierent actions. This may be because erratic behavior in the
environment is dangerous (e.g., driving autonomous vehicles) or because data collection
is time-consuming or expensive (e.g., making nancial trades).
However, it is possible to gather historical data from human agents in both cases.
Oine RL or batch RL aims to learn how to take actions that maximize rewards on future
episodes by observing past sequences s
1
, a
1
, r
2
, s
2
, a
2
, r
3
, . . ., without ever interacting
with the environment. It is distinct from imitation learning, a related technique that (i)
does not have access to the rewards and (ii) attempts to replicate the performance of a
historical agent rather than improve it.
Although there are oine RL methods based on Q-Learning and policy gradients,
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
19.8 Summary 395
this paradigm opens up new possibilities. In particular, we can treat this as a sequence
learning problem, in which the goal is to predict the next action, given the history of
states, rewards, and actions. The decision transformer exploits a transformer decoder
framework (section 12.7) to make these predictions (gure 19.17).
However, the goal is to predict actions based on future rewards, and these are not
captured in a standard s, a, r sequence. Hence, the decision transformer replaces the re-
ward r
t
with the returns-to-go R
t:T
=
P
T
t
=t
r
t
(i.e., the sum of the empirically observed
future rewards). The remaining framework is very similar to a standard transformer
decoder. The states, actions, and returns-to-go are converted to xed-size embeddings
via learned mappings. For Atari games, the state embedding might be converted via a
convolutional network similar to that in gure 19.14. The embeddings for the actions
and returns-to-go can be learned in the same way as word embeddings (gure 12.9). The
transformer is trained with masked self-attention and position embeddings.
This formulation is natural during training but poses a quandary during inference
because we don’t know the returns-to-go. This can be resolved by using the desired total
return at the rst step and decrementing this as rewards are received. For example, in
an Atari game, the desired total return would be the total score required to win.
Decision transformers can also be ne-tuned from online experience and hence learn
over time. They have the advantage of dispensing with most of the reinforcement learn-
ing machinery and its associated instability and replacing this with standard supervised
learning. Transformers can learn from enormous quantities of data and integrate infor-
mation across large time contexts (making the temporal credit assignment problem more
tractable). This represents an intriguing new direction for reinforcement learning.
19.8 Summary
Reinforcement learning is a sequential decision-making framework for Markov decision
processes and similar systems. This chapter reviewed tabular approaches to RL, includ-
ing dynamic programming (in which the environment model is known), Monte Carlo
methods (in which multiple episodes are run and the action values and policy subse-
quently changed based on the rewards received), and temporal dierence methods (in
which these values are updated while the episode is ongoing).
Deep Q-Learning is a temporal dierence method where deep neural networks are
used to predict the action value for every state. It can train agents to perform Atari
2600 games at a level similar to humans. Policy gradient methods directly optimize the
policy rather than assigning values to actions. They produce stochastic policies, which
are important when the environment is partially observable. The updates are noisy, and
many renements have been introduced to reduce their variance.
Oine reinforcement learning is used when we cannot interact with the environment
but must learn from historical data. The decision transformer leverages recent progress
in deep learning to build a model of the state-action-reward sequence and predict the
actions that will maximize the rewards.
Draft: please send errata to udlbookmail@gmail.com.
396 19 Reinforcement learning
Notes
Sutton & Barto (2018) cover tabular reinforcement learning methods in depth. Li (2017),
Arulkumaran et al. (2017), François-Lavet et al. (2018), and Wang et al. (2022c) all provide
overviews of deep reinforcement learning. Graesser & Keng (2019) is an excellent introductory
resource that includes Python code.
Landmarks in deep reinforcement learning: Most landmark achievements of reinforcement
learning have been in either video games or real-world games since these provide constrained en-
vironments with limited actions and xed rules. Deep Q-Learning (Mnih et al., 2015) achieved
human-level performance across a benchmark of ATARI games. AlphaGo (Silver et al., 2016)
beat the world champion at Go. This game was previously considered very dicult for comput-
ers to play. Berner et al. (2019) built a system that beat the world champion team in the ve vs.
ve-player game Defense of the Ancients 2, which requires cooperation across players. Ye et al.
(2021) built a system that could beat humans on Atari games with limited data (in contrast to
previous systems, which need much more experience than humans). More recently, the Cicero
system demonstrated human-level performance in the game Diplomacy which requires natural
language negotiations and coordination between players (FAIR, 2022).
RL has also been applied successfully to combinatorial optimization problems (see Mazyavkina
et al., 2021). For example, Kool et al. (2019) learned a model that performed similarly to the
best heuristics for the traveling salesman problem. Recently, AlphaTensor (Fawzi et al., 2022)
treated matrix multiplication as a game and learned faster ways to multiply matrices using fewer
multiplication operations. Since deep learning relies heavily on matrix multiplication, this is
one of the rst examples of self-improvement in AI.
Classical reinforcement learning methods: Very early contributions to the theory of MDPs
were made by Thompson (1933) and Thompson (1935). The Bellman recursions were introduced
by Bellman (1966). Howard (1960) introduced policy iteration. Sutton & Barto (2018) identify
the work of Andreae (1969) as being the rst to describe RL using the MDP formalism.
The modern era of reinforcement learning arguably originated in the Ph.D. theses of Sutton
(1984) and Watkins (1989). Sutton (1988) introduced the term temporal dierence learning.
Watkins (1989) and Watkins & Dayan (1992) introduced Q-Learning and showed that it con-
verges to a xed point by Banach’s theorem because the Bellman operator is a contraction
mapping.
Watkins (1989) made the rst explicit connection between dynamic programming
and reinforcement learning. SARSA was developed by Rummery & Niranjan (1994). Gordon
(1995) introduced tted Q-learning in which a machine learning model is used to predict the
action value for each state-action pair. Riedmiller (2005) introduced neural-tted Q-learning,
which used a neural network to predict all the action values at once from a state. Early work
on Monte Carlo methods was carried out by Singh & Sutton (1996), and the exploring starts
algorithm was introduced by Sutton & Barto (1999). Note that this is an extremely cursory
summary of more than fty years of work. A much more thorough treatment can be found in
Sutton & Barto (2018).
Deep Q-Networks: Deep Q-Learning was devised by Mnih et al. (2015) and is an intellectual
descendent of neural-tted Q-learning. It exploited the then-recent successes of convolutional
networks to develop a tted Q-Learning method that could achieve human-level performance
on a benchmark of ATARI games. Deep Q-Learning suers from the deadly triad issue (Sutton
& Barto, 2018): training can be unstable in any scheme that incorporates (i) bootstrapping, (ii)
o-policy learning, and (iii) function approximation. Much subsequent work has aimed to make
training more stable. Mnih et al. (2015) introduced the experience replay buer (Lin, 1992),
which was subsequently improved by Schaul et al. (2016) to favor more important tuples and
hence increase learning speed. This is termed prioritized experience replay.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 397
The original Q-Learning paper concatenated four frames so the network could observe the
velocities of objects and make the underlying process closer to fully observable. Hausknecht &
Stone (2015) introduced deep recurrent Q-learning, which used a recurrent network architecture
that only ingested a single frame at a time because it could “remember” the previous states.
Van Hasselt (2010) identied the systematic overestimation of the state values due to the max
operation and proposed double Q-Learning in which two models are trained simultaneously to
remedy this. This was subsequently applied in the context of deep Q-learning (Van Hasselt
et al., 2016), although its ecacy has since been questioned (Hessel et al., 2018). Wang et al.
(2016) introduced deep dueling networks in which two heads of the same network predict (i)
the state value and (ii) the advantage (relative value) of each action. The intuition here is that
sometimes it is the state value that is important, and it doesn’t matter much which action is
taken, and decoupling these estimates improves stability.
Fortunato et al. (2018) introduced noisy deep Q-Networks, in which some weights in the Q-
Network are multiplied by noise to add stochasticity to the predictions and encourage explo-
ration. The network can learn to decrease the magnitudes of the noise over time as it converges
to a sensible policy. Distributional DQN (Bellemare et al., 2017a; Dabney et al., 2018 follow-
ing Morimura et al., 2010) aims to estimate more complete information about the distribution
of returns than just the expectation. This potentially allows the network to mitigate against
worst-case outcomes and can also improve performance, as predicting higher moments provides
a richer training signal. Rainbow (Hessel et al., 2018) combined six improvements to the original
deep Q-learning algorithm, including dueling networks, distributional DQN, and noisy DQN,
to improve both the training speed and the nal performance on the ATARI benchmark.
Policy gradients: Williams (1992) introduced the REINFORCE algorithm. The term “policy
gradient method” dates to Sutton et al. (1999). Konda & Tsitsiklis (1999) introduced the actor-
critic algorithm. Decreasing the variance by using dierent baselines is discussed in Greensmith
et al. (2004) and Peters & Schaal (2008). It has since been argued that the value baseline
primarily reduces the aggressiveness of the updates rather than their variance (Mei et al., 2022).
Policy gradients have been adapted to produce deterministic policies (Silver et al., 2014; Lillicrap
et al., 2016; Fujimoto et al., 2018). The most direct approach is to maximize over the possible
actions, but if the action space is continuous, this requires an optimization procedure at each
step. The deep deterministic policy gradient algorithm (Lillicrap et al., 2016) moves the policy
in the direction of the gradient of the action value (implying the use of an actor-critic method).
Modern policy gradients: We introduced policy gradients in terms of the parameter update.
However, they can also be viewed as optimizing a surrogate loss based on importance sampling
of the expected rewards, using trajectories from the current policy parameters. This view allows
us to take multiple optimization steps validly. However, this can cause very large policy updates.
Overstepping is a minor problem in supervised learning, as the trajectory can be corrected later.
However, in RL, it aects future data collection and can be extremely destructive.
Several methods have been proposed to moderate these updates. Natural policy gradients
(Kakade, 2001) are based on natural gradients (Amari, 1998), which modify the descent di-
rection by the Fisher information matrix. This provides a better update which is less likely to
get stuck in local plateaus. However, the Fisher matrix is impractical to compute in models
with many parameters. In trust-region policy optimization or TRPO (Schulman et al., 2015),
the surrogate objective is maximized subject to a constraint on the KL divergence between the
old and new policies. Schulman et al. (2017) propose a simpler formulation in which this KL
divergence appears as a regularization term. The regularization weight is adapted based on
the distance between the KL divergence and a target indicating how much we want the policy
to change. Proximal policy optimization or PPO (Schulman et al., 2017) is an even simpler
approach in which the loss is clipped to ensure smaller updates.
Actor-critic: In the actor-critic algorithm (Konda & Tsitsiklis, 1999) described in section 19.6,
the critic used a 1-step estimator. It’s also possible to use k-step estimators (in which we
Draft: please send errata to udlbookmail@gmail.com.
398 19 Reinforcement learning
observe k discounted rewards and approximate subsequent rewards with an estimate of the
state value). As k increases, the variance of the estimate increases, but the bias decreases.
Generalized advantage estimation (Schulman et al., 2016) weights together estimates from many
steps and parameterizes the weighting by a single term that trades o the bias and the variance.
Mnih et al. (2016) introduced asynchronous actor-critic or A3C in which multiple agents are
run independently in parallel environments and update the same parameters. Both the policy
and value function are updated every T time steps using a mix of k-step returns. Wang et al.
(2017) introduced several methods designed to make asynchronous actor-critic more ecient.
Soft actor-critic (Haarnoja et al., 2018b) adds an entropy term to the cost function, which
encourages exploration and reduces overtting as the policy is encouraged to be less condent.
Oine RL: In oine reinforcement learning, the policy is learned by observing the behavior
of other agents, including the rewards they receive, without the ability to change the policy. It
is related to imitation learning, where the goal is to copy the behavior of another agent without
access to rewards (see Hussein et al., 2017). One approach is to treat oine RL in the same
way as o-policy reinforcement learning. However, in practice, the distributional shift between
the observed and applied policy manifests in overly optimistic estimates of the action value
and poor performance (see Fujimoto et al., 2019; Kumar et al., 2019a; Agarwal et al., 2020).
Conservative Q-learning (Kumar et al., 2020b) learns conservative, lower-bound estimates of
the value function by regularizing the Q-values. The decision transformer (Chen et al., 2021c)
is a simple approach to oine learning that takes advantage of the well-studied self-attention
architecture. It can subsequently be ne-tuned with online training (Zheng et al., 2022).
Reinforcement learning and chatbots: Chatbots can be trained using a technique known
as reinforcement learning with human feedback or RLHF (Christiano et al., 2018; Stiennon et al.,
2020). For example, InstructGPT (the forerunner of ChatGPT, Ouyang et al., 2022) starts with
a standard transformer decoder model. This is then ne-tuned based on prompt-response pairs
where the response was written by human annotators. During this training step, the model is
optimized to predict the next word in the ground truth response.
Unfortunately, such training data are expensive to produce in sucient quantities to support
high-quality performance. To resolve this problem, human annotators then indicate which of
several model responses they prefer. These (much cheaper) data are used to train a reward
model. This is a second transformer network that ingests the prompt and model response and
returns a scalar indicating how good the response is. Finally, the ne-tuned chatbot model is
further trained to produce high rewards using the reward model as supervision. Here, standard
gradient descent cannot be used as it’s not possible to compute derivatives through the sampling
procedure in the chatbot output. Hence, the model is trained with proximal policy optimization
(a policy gradient method where the derivatives are tractable) to generate higher rewards.
Other areas of RL: Reinforcement learning is an enormous area, which easily justies its
own book, and this literature review is extremely supercial. Other notable areas of RL that
we have not discussed include model-based RL, in which the state transition probabilities and
reward functions are modeled (see Moerland et al., 2023). This allows forward planning and
has the advantage that the same model can be reused for dierent reward structures. Hybrid
methods such as AlphaGo (Silver et al., 2016) and MuZero (Schrittwieser et al., 2020) have
separate models for the dynamics of the states, the policy, and the value of future positions.
This chapter has only discussed simple methods for exploration, like the epsilon-greedy ap-
proach, noisy Q-learning, and adding an entropy term to penalize overcondent policies. In-
trinsic motivation refers to methods that add rewards for exploration and thus imbue the agent
with “curiosity” (see Barto, 2013; Aubret et al., 2019). Hierarchical reinforcement learning (see
Pateria et al., 2021) refers to methods that break down the nal objective into sub-tasks. Multi-
agent reinforcement learning (see Zhang et al., 2021a) considers the case where multiple agents
coexist in a shared environment. This may be in either a competitive or cooperative context.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Notes 399
Problems
Problem 19.1 Figure 19.18 shows a single trajectory through the example MDP. Calculate the
return for each step in the trajectory given that the discount factor γ is 0.9.
Problem 19.2
Prove the policy improvement theorem. Consider changing from policy π to
policy π
, where for state s
t
the new policy π
chooses the action that maximizes the expected
return:
π
[a
t
|s
t
] = argmax
a
t
r[s
t
, a
t
] + γ ·
X
s
t+1
P r(s
t+1
|s
t
, a
t
)v[s
t+1
|π]
. (19.43)
and for all other states, the policies are the same. Show that the value v[s
t
|π] for the original
policy must be less than or equal to v[s
t
|π
] = q
s
t
, π
[a|s
t
]
π
for the new policy:
v[s
t
|π] q
h
s
t
, π
[a
t
|s
t
]
π
i
= E
π
h
r
t+1
+ γ · v[s
t+1
|π]
i
. (19.44)
Hint: Start by writing the term v[s
t+1
|π] in terms of the new policy.
Problem 19.3 Show that when the state values and policy are initialized as in gure 19.10a,
they become those in gure 19.10b after two iterations of (i) policy evaluation (in which all
states are updated based on their current values and then replace the previous ones) and (ii)
policy improvement. The state transition allots half the probability to the direction the policy
indicates and divides the remaining probability equally between the other valid actions. The
reward function returns -2 irrespective of the action when the penguin leaves a hole. The reward
function returns +3 regardless of the action when the penguin leaves the sh tile and the episode
ends, so the sh tile has a value of +3.
Problem 19.4 The Boltzmann policy strikes a balance between exploration and exploitation by
basing the action probabilities π[a|s] on the current state-action reward function q[s, a]:
π[a|s] =
exp
q[s, a]
P
a
exp
q[s, a
]
. (19.45)
Explain how the temperature parameter τ can be varied to prioritize exploration or exploitation.
Problem 19.5
When the learning rate α is one, the Q-Learning update is given by:
f
q[s, a]
= r[s, a] + γ · max
a
q[s
, a]
. (19.46)
Show that this is a contraction mapping (equation 16.30) so that:
f
q
1
[s, a]
f
q
2
[s, a]
<
q
1
[s, a] q
2
[s, a]
q
1
, q
2
. (19.47)
where ||||
represents the
norm. It follows that a xed point will exist by Banach’s theorem
Appendix B.3.2
Vector norms
and that the updates will eventually converge.
Draft: please send errata to udlbookmail@gmail.com.
400 19 Reinforcement learning
Figure 19.18 One trajectory through an
MDP. The penguin receives a reward
of +1 when it reaches the rst sh
tile, 2 when it falls in the hole, and +1
for reaching the second sh tile. The dis-
count factor γ is 0.9.
Problem 19.6 Show that:
E
τ
θ
log
P r(τ |θ)
b
= 0, (19.48)
and so adding a baseline update doesn’t change the expected policy gradient update.
Problem 19.7
Suppose that we want to estimate a quantity E[a] from samples a
1
, a
2
. . . a
I
.
Consider that we also have paired samples b
1
, b
2
. . . b
I
that are samples that co-vary with a
where E[b] = µ
b
. We dene a new variable:
a
= a c(b µ
b
). (19.49)
Show that Var[a
] Var[a] when the constant c is chosen judiciously. Find an expression for
the optimal value of c.
Problem 19.8 The estimate of the gradient in equation 19.34 can be written as:
E
τ
h
g[θ](r[τ
t
] b)
i
, (19.50)
where
g[θ] =
T
X
t=1
log
P r(a
t
|s
t
, θ)]
θ
, (19.51)
and
r[τ
t
] =
T
X
k=t
r
k
. (19.52)
Show that the value of b that minimizes the variance of the gradient estimate is given by:
b =
E[g[τ ]
2
]r[τ ]
E[g[τ ]
2
]
. (19.53)
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Chapter 20
Why does deep learning work?
This chapter diers from those that precede it. Instead of presenting established results,
it poses questions about how and why deep learning works so well. These questions are
rarely discussed in textbooks. However, it’s important to realize that (despite the title
of this book) understanding of deep learning is still limited.
We argue that it is surprising that deep networks are easy to train and also surprising
that they generalize. Then we consider each of these topics in turn. We enumerate the
factors that inuence training success and discuss what is known about loss functions for
deep networks. Then we consider the factors that inuence generalization. We conclude
with a discussion of whether networks need to be overparameterized and deep.
20.1 The case against deep learning
The MNIST-1D dataset (gure 8.1) has just forty input dimensions and ten output
dimensions. With enough hidden units per layer, a two-layer fully connected network
classies 10000 MNIST-1D training data points perfectly and generalizes reasonably to
unseen examples (gure 8.10a). Indeed, we now take it for granted that with sucient
hidden units, deep networks will classify almost any training set near-perfectly. We also
take for granted that the tted model will generalize to new data. However, it’s not at
all obvious either that the training process should succeed or that the resulting model
should generalize. This section argues that both these phenomena are surprising.
20.1.1 Training
Performance of a two-layer fully connected network on 10000 MNIST-1D training exam-
ples is perfect once there are 43 hidden units per layer (4000 parameters). However,
nding the global minimum of an arbitrary non-convex function is NP-hard (Murty &
Kabadi, 1987), and this is also true for certain neural network loss functions (Blum &
Rivest, 1992). It’s remarkable that the tting algorithm doesn’t get trapped in local min-
ima or stuck near saddle points and that it can eciently recruit spare model capacity
Draft: please send errata to udlbookmail@gmail.com.
402 20 Why does deep learning work?
to t unexplained training data wherever they lie.
Perhaps this success is less surprising when there are far more parameters than train-
ing data. However, it’s debatable whether this is generally the case. AlexNet had 60
million parameters and was trained with 1 million data points. However, to complicate
matters, each training example was augmented with 2048 transformations. GPT-3 had
175 billion parameters and was trained with 300 billion tokens. There is not a clear-cut
case that either model was overparameterized, and yet they were successfully trained.
In short, it’s surprising that we can t deep networks reliably and eciently. Either
the data, the models, the training algorithms, or some combination of all three must
have some special properties that make this possible.
20.1.2 Generalization
If the ecient tting of neural networks is startling, their generalization to new data
is dumbfounding. First, it’s not obvious a priori that typical datasets are sucient to
characterize the input/output mapping. The curse of dimensionality implies that the
training dataset is tiny compared to the possible inputs; if each of the 40 inputs of the
MNIST-1D data were quantized into 10 possible values, there would be 10
40
possible
inputs, which is a factor of 10
35
more than the number of training examples.
Problem 20.1
Second, deep networks describe very complicated functions. A fully connected net-
work for MNIST-1D with two hidden layers of width 400 can create mappings with up
to 10
42
linear regions. That’s roughly 10
37
regions per training example, so very few of
these regions contain data at any stage during training; regardless, those regions that do
encounter data points constrain the remaining regions to behave reasonably.
Third, generalization gets better with more parameters (gure 8.10). The model in the
previous paragraph has 177,201 parameters. Assuming it can t one training example
per parameter, it has 167,201 spare degrees of freedom. This surfeit gives the model
latitude to do almost anything between the training data, and yet it behaves sensibly.
20.1.3 The unreasonable eectiveness of deep learning
To summarize, it’s neither obvious that we should be able to t deep networks nor that
they should generalize. A priori, deep learning shouldn’t work. And yet it does. This
chapter investigates why. Sections 20.2–20.3 describe what we know about tting deep
networks and their loss functions. Sections 20.4–20.6 examine generalization.
20.2 Factors that inuence tting performance
Figure 6.4 showed that loss functions for nonlinear models can have both local minima
and saddle points. However, we can reliably t deep networks to complex training sets.
For example, gure 8.10 shows perfect training performance on MNIST-1D, MNIST, and
CIFAR-100. This section considers factors that might resolve this contradiction.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
20.2 Factors that inuence tting performance 403
Figure 20.1 Fitting random data. Losses
for AlexNet architecture trained on
CIFAR-10 dataset with SGD. When the
pixels are drawn from a Gaussian ran-
dom distribution with the same mean
and variance as the original data, the
model can still be t (albeit more
slowly). When the labels are random-
ized, the model can still be t (albeit
even more slowly). Adapted from Zhang
et al. (2017a).
20.2.1 Dataset
It’s important to realize that we can’t learn any function. Consider a completely random
mapping from every possible 28×28 binary image to one of ten categories. Since there
is no structure to this function, the only recourse is to memorize the 2
784
assignments.
However, it’s easy to train a model on the MNIST dataset (gures 8.10 and 15.15),
which contains 60,000 examples of 28×28 images labeled with one of ten categories. One
explanation for this contradiction could be that it is easy to nd global minima because
the real-world functions that we approximate are relatively simple.
1
This hypothesis was investigated by Zhang et al. (2017a), who trained AlexNet on the
Notebook 20.1
Random data
CIFAR-10 image classication dataset when (i) each image was replaced with Gaussian
noise and (ii) the labels of the ten classes were randomly permuted (gure 20.1). These
changes slowed down learning, but the network could still t this nite dataset well.
Problem 20.2
This suggests that the properties of the dataset aren’t critical.
20.2.2 Regularization
Another possible explanation for the ease with which models are trained is that some
regularization methods like L2 regularization (weight decay) make the loss surface atter
and more convex. However, Zhang et al. (2017a) found that neither L2 regularization nor
Dropout was required to t random data. This does not eliminate implicit regularization
due to the nite step size of the tting algorithms (section 9.2). However, this eect
increases with the learning rate (equation 9.9), and model-tting does not get easier
with larger learning rates.
20.2.3 Stochastic training algorithms
Chapter 6 argued that the SGD algorithm potentially allows the optimization trajectory
to move between “valleys” during training. However, Keskar et al. (2017) show that
several models (including fully connected and convolutional networks) can be t to many
1
In this chapter, we use the term “global minimum” loosely to mean any solution where all data are
classied correctly. We have no way of knowing if there are solutions with a lower loss elsewhere.
Draft: please send errata to udlbookmail@gmail.com.
404 20 Why does deep learning work?
Figure 20.2 MNIST-1D training. Four
fully connected networks were t to 4000
MNIST-1D examples with random labels
using full batch gradient descent, He ini-
tialization, no momentum or regulariza-
tion, and learning rate 0.0025. Mod-
els with 1,2,3,4 layers had 298, 100, 75,
and 63 hidden units per layer and 15208,
15210, 15235, and 15139 parameters, re-
spectively. All models train successfully,
but deeper models require fewer epochs.
datasets (including CIFAR-100 and MNIST) almost perfectly with very large batches of
5000-6000 images. This eliminates most of the randomness but training still succeeds.
Figure 20.2 shows training results for four fully connected models tted to 4000
Notebook 20.2
Full batch
gradient descent
MNIST-1D examples with randomized labels using full-batch (i.e., non-stochastic) gra-
dient descent. There was no explicit regularization, and the learning rate was set to a
small constant value of 0.0025 to minimize implicit regularization. Here, the true map-
Problem 20.3
ping from data to labels has no structure, the training is deterministic, and there is no
regularization, and yet the training error still decreases to zero. This suggests that these
loss functions may genuinely have no local minima.
20.2.4 Overparameterization
Overparameterization almost certainly is an important factor that contributes to ease
of training. It implies that there is a large family of degenerate solutions, so there may
always be a direction in which the parameters can be modied to decrease the loss.
Sejnowski (2020) suggests that . . . the degeneracy of solutions changes the nature of
the problem from nding a needle in a haystack to a haystack of needles.
In practice, networks are frequently overparameterized by one or two orders of mag-
nitude (gure 20.3). However, data augmentation makes it dicult to make precise
statements. Augmentation may increase the data by several orders of magnitude, but
these are manipulations of existing examples rather than independent new data points.
Moreover, gure 8.10 shows that neural networks can sometimes t the training data
well when there are the same number or fewer parameters than data points. This is
presumably due to redundancy in training examples from the same underlying function.
Several theoretical convergence results show that, under certain circumstances, SGD
converges to a global minimum when the network is suciently overparameterized. For
example, Du et al. (2019b) show that randomly initialized SGD converges to a global
minimum for shallow fully connected ReLU networks with a least squares loss with
enough hidden units. Similarly, Du et al. (2019a) consider deep, residual, and convolu-
tional networks when the activation function is smooth and Lipschitz. Zou et al. (2020)
analyzed the convergence of gradient descent on deep, fully connected networks using a
hinge loss. Allen-Zhu et al. (2019) considered deep networks with ReLU functions.
If a neural network is suciently overparameterized so that it can memorize any
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
20.2 Factors that inuence tting performance 405
Figure 20.3 Overparameterization. Im-
ageNet performance for convolutional
nets as a function of overparameteriza-
tion (in multiples of dataset size). Most
models have 10–100 times more param-
eters than there were training exam-
ples. Models compared are ResNet (He
et al., 2016a,b), DenseNet (Huang et al.,
2017b), Xception (Chollet, 2017), E-
cientNet (Tan & Le, 2019), Inception
(Szegedy et al., 2017), ResNeXt (Xie
et al., 2017), and AmoebaNet (Cubuk
et al., 2019).
dataset of a xed size, then all stationary points become global minima (Livni et al.,
2014; Nguyen & Hein, 2017, 2018). Other results show that if the network is wide
enough, local minima where the loss is higher than the global minimum are rare (see
Choromanska et al., 2015; Pascanu et al., 2014; Pennington & Bahri, 2017). Kawaguchi
et al. (2019) prove that as a network becomes deeper, wider, or both, the loss at local
minima becomes closer to that at the global minimum for squared loss functions.
These theoretical results are intriguing but usually make unrealistic assumptions
about the network structure. For example, Du et al. (2019a) show that residual networks
converge to zero training loss when the width of the network D (i.e., the number of hidden
units) is Ω[I
4
K
2
] where I is the amount of training data, and K is the depth of the
network. Similarly, Nguyen & Hein (2017) assume that the network’s width is larger than
the dataset size, which is unrealistic in most practical scenarios. Overparameterization
seems to be important, but theory cannot yet explain empirical tting performance.
20.2.5 Activation functions
The activation function is also known to aect training diculty. Networks where the
activation only changes over a small part of the input range are harder to t than ReLUs
(which vary over half the input range) or Leaky ReLUs (which vary over the full range);
For example, sigmoid and tanh nonlinearities (gure 3.13a) have shallow gradients in
their tails; where the activation function is near-constant, the training gradient is near-
zero, so there is no mechanism to improve the model.
20.2.6 Initialization
Another potential explanation is that Xavier/He initialization sets the parameters to
values that are easy to optimize. Of course, for deeper networks, such initialization is
necessary to avoid exploding and vanishing gradients, so in a trivial sense, initialization
is critical to training success. However, for shallower networks, the initial variance of the
weights is less important. Liu et al. (2023c) trained a 3-layer fully connected network with
Draft: please send errata to udlbookmail@gmail.com.
406 20 Why does deep learning work?
Figure 20.4 Initialization and tting. A
three-layer fully connected network with
200 hidden units per layer was trained
on 1000 MNIST examples with AdamW
using one-hot targets and mean-squared
error loss. It takes longer to t networks
when larger multiples of He initialization
are used, but this doesn’t change the out-
come. This may simply reect the ex-
tra distance that the weights must move.
Adapted from Liu et al. (2023c).
200 hidden units per layer on 1000 MNIST data points. They found that more iterations
were required to t the training data as the variance increased from that proposed by He
(gure 20.4), but this did not ultimately impede tting. Hence, initialization doesn’t shed
much light on why tting neural networks is easy, although exploding/vanishing gradients
do reveal initializations that make training dicult with nite precision arithmetic.
20.2.7 Network depth
Neural networks are harder to t when the depth becomes very large due to exploding
and vanishing gradients (gure 7.7) and shattered gradients (gure 11.3). However,
these are (arguably) practical numerical issues. There is no denitive evidence that
the underlying loss function is fundamentally more or less convex as the network depth
increases. Figure
20.2 does show that for MNIST data with randomized labels and He
initialization, deeper networks train in fewer iterations. However, this might be because
either (i) the gradients in deeper networks are steeper or (ii) He initialization just starts
wider, shallower networks further away from the optimal parameters.
Frankle & Carbin (2019) show that for small networks like VGG, you can get the
same or better performance if you (i) train the network, (ii) prune the weights with
the smallest magnitudes and (iii) retrain from the same initial weights. This does not
work if the weights are randomly re-initialized. They concluded that the original over-
parameterized network contains small trainable sub-networks, which are sucient to
Notebook 20.3
Lottery tickets
provide the performance. They term this the lottery ticket hypothesis and denote the
sub-networks as winning tickets. This suggests that the eective number of sub-networks
may have a key role to play in tting. This (perhaps) varies with the network depth for
a xed parameter count, but a precise characterization of this idea is lacking.
20.3 Properties of loss functions
The previous section discussed factors that contribute to the ease with which neural net-
works can be trained. The number of parameters (degree of overparameterization) and
the choice of activation function are both important. Surprisingly, the choice of dataset,
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
20.3 Properties of loss functions 407
the randomness of the tting algorithm, and the use of regularization don’t seem impor-
tant. There is no denitive evidence that (for a xed parameter count) the depth of the
network matters (other than numerical problems due to exploding/vanishing/shattered
gradients). This section tackles the same topic from a dierent angle by considering the
empirical properties of loss functions. Most of this evidence comes from fully connected
networks and CNNs; loss functions of transformer networks are less well understood.
20.3.1 Multiple global minima
We expect loss functions for deep networks to have a large family of equivalent global
minima. In fully connected networks, the hidden units at each layer and their associated
weights can be permuted without changing the output. In convolutional networks, per-
muting the channels and convolution kernels appropriately doesn’t change the output.
We can multiply the weight before any ReLU function and divide the weight after by a
positive number without changing the output. Using BatchNorm induces another set of
redundancies because the mean and variance of each hidden unit or channel are reset.
The above modications all produce the same output for every input. However, the
global minimum only depends on the output at the training data points. In overparam-
eterized networks, there will also be families of solutions that behave identically at the
data points but dierently between them. All of these are also global minima.
20.3.2 Route to the minimum
Goodfellow et al. (2015b) considered a straight line between the initial parameters and
the nal values. They show that the loss function along this line usually decreases
monotonically (except for a small bump near the start sometimes). This phenomenon is
observed for several dierent types of networks and activation functions (gure 20.5a).
Of course, real optimization trajectories do not proceed in a straight line. However,
Li et al. (2018b) nd that they do lie in low-dimensional subspaces. They attribute this
to the existence of large, nearly convex regions in the loss landscape that capture the
trajectory early on and funnel it in a few important directions. Surprisingly, Li et al.
(2018a) showed that networks still train well if optimization is constrained to lie in a
random low-dimensional subspace (gure 20.6).
Li & Liang (2018) show that the relative change in the parameters during training
decreases as network width increases; for larger widths, the parameters start at smaller
values, change by a smaller proportion of those values, and converge in fewer steps.
20.3.3 Connections between minima
Goodfellow et al. (2015b) examined the loss function along a straight line between two
minima that were found independently. They saw a pronounced increase in the loss be-
tween them (gure 20.5b); good minima are not generally linearly connected. However,
Draft: please send errata to udlbookmail@gmail.com.
408 20 Why does deep learning work?
Figure 20.5 Linear slices through loss function. a) A two-layer fully connected
ReLU network is trained on MNIST. The loss along a straight line starting at the
initial parameters (δ=0) and nishing at the trained parameters (δ=1) descends
monotonically. b) However, in this two-layer fully connected MaxOut network on
MNIST, there is an increase in the loss along a straight line between one solution
(δ=0) and another (δ=1). Adapted from Goodfellow et al. (2015b).
Figure 20.6 Subspace training. A fully
connected network with two hidden lay-
ers, each with 200 units was trained on
MNIST. Parameters were initialized us-
ing a standard method but then con-
strained to lie within a random sub-
space. Performance reaches 90% of the
unconstrained level when this subspace is
750D (termed the intrinsic dimension),
which is 0.4% of the original parameters.
Adapted from Li et al. (2018a).
Frankle et al. (2020) showed that this increase vanishes if the networks are identically
trained initially and later allowed to diverge by using dierent SGD noise and augmen-
tation. This suggests that the solution is constrained early in training and that some
families of minima are linearly connected.
Draxler et al. (2018) found minima with good (but dierent) performance on the
CIFAR-10 dataset. They then showed that it is possible to construct paths from one to
the other, where the loss function remains low along this path. They conclude that there
is a single connected manifold of low loss (gure 20.7). This seems to be increasingly
true as the width and depth of the network increase. Garipov et al. (2018) and Fort &
Jastrzębski (2019) present other schemes for connecting minima.
20.3.4 Curvature of loss surface
Random Gaussian functions (in which points are jointly distributed with covariance
given by a kernel function of their distance) have an interesting property: for points
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
20.3 Properties of loss functions 409
Figure 20.7 Connections between min-
ima. A slice through the loss function
of DenseNet on CIFAR-10. Parameters
ϕ
1
and ϕ
2
are two independently discov-
ered minima. Linear interpolation be-
tween these parameters reveals an en-
ergy barrier (dashed line). However, for
suciently deep and wide networks, it
is possible to nd a curved path of low
energy between two minima (cyan line).
Adapted from Draxler et al. (2018).
Figure 20.8 Critical points vs. loss. a) In random Gaussian functions, the number
of directions in which the function curves down at points with zero gradient
decreases with the height of the function, so minima all appear at lower function
values. b) Dauphin et al. (2014) found critical points on a neural network loss
surface (i.e., points with zero gradient). They showed that the proportion of
negative eigenvalues (directions that point down) decreases with the loss. The
implication is that all minima (points with zero gradient where no directions
point down) have low losses. Adapted from Dauphin et al. (2014) and Bahri
et al. (2020).
Figure 20.9 Goldilocks zone. The pro-
portion of eigenvalues of the Hessian that
are greater than zero (a measure of pos-
itive curvature/convexity) within a ran-
dom subspace of dimension D
s
in a two-
layer fully connected network with ReLU
functions applied to MNIST as a func-
tion of the squared radius r
2
of the pa-
rameters relative to Xavier initialization.
There is a pronounced region of positive
curvature known as the Goldilocks zone.
Adapted from Fort & Scherlis (2019).
Draft: please send errata to udlbookmail@gmail.com.
410 20 Why does deep learning work?
Figure 20.10 Batch size to learning rate
ratio. Generalization of two models
on the CIFAR-10 database depends on
the ratio of batch size to the learning
rate. As the batch size increases, gen-
eralization decreases. As the learning
rate increases, generalization increases.
Adapted from He et al. (2019).
where the gradient is zero, the fraction of directions where the function curves down
becomes smaller when these points occur at lower loss values (see Bahri et al., 2020).
Dauphin et al. (2014) searched for saddle points in a neural network loss function and
similarly found a correlation between the loss and the number of negative eigenvalues
(gure 20.8). Baldi & Hornik (1989) analyzed the error surface of a shallow network and
found that there were no local minima but only saddle points. These results suggest that
there are few or no bad local minima.
Fort & Scherlis (2019) measured the curvature at random points on a neural network
loss surface; they showed that the curvature of the surface is unusually positive when
the
2
norm of the weights lies within a certain range (gure 20.9), which they term the
Goldilocks zone. He and Xavier initialization fall within this range.
20.4 Factors that determine generalization
The last two sections considered factors that determine whether the network trains suc-
cessfully and what is known about neural network loss functions. This section considers
factors that determine how well the network generalizes. This complements the discus-
sion of regularization (chapter 9), which explicitly aims to encourage generalization.
20.4.1 Training algorithms
Since deep networks are usually overparameterized, the details of the training process
determine which of the degenerate family of minima the algorithm converges to. Some
of these details reliably improve generalization.
LeCun et al. (2012) show that SGD generalizes better than full-batch gradient de-
scent. It has been argued that SGD generalizes better than Adam (e.g., Wilson et al.,
2017; Keskar & Socher, 2017), but more recent studies suggest that there is little dif-
ference when the hyperparameter search is done carefully (Choi et al., 2019). Keskar
et al. (2017) show that deep nets generalize better with smaller batch-size when no other
form of regularization is used. It is also well-known that larger learning rates tend to
generalize better (e.g., gure 9.5). Jastrzębski et al. (2018), Goyal et al. (2018), and He
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
20.4 Factors that determine generalization 411
Figure 20.11 Flat vs. sharp minima.
Flat minima are expected to generalize
better. Small errors in estimating the pa-
rameters or in the alignment of the train
and test loss functions are less problem-
atic in at regions. Adapted from Keskar
et al. (2017).
et al. (2019) argue that the batch size/learning rate ratio is important. He et al. (2019)
show a signicant correlation between this ratio and the degree of generalization and
prove a generalization bound for neural networks, which has a positive correlation with
this ratio (gure 20.10).
These observations are aligned with the discovery that SGD implicitly adds regu-
larization terms to the loss function (section 9.2), and their magnitude depends on the
learning rate. The trajectory of the parameters is changed by this regularization, and
they converge to a part of the loss function that generalizes well.
20.4.2 Flatness of minimum
There has been speculation dating at least to Hochreiter & Schmidhuber (1997a) that
at minima in the loss function generalize better than sharp minima (gure 20.11).
Informally, if the minimum is atter, then small errors in the estimated parameters are
less important. This can also be motivated from various theoretical viewpoints. For
example, minimum description length theory suggests models specied by fewer bits
generalize better (Rissanen, 1983). For wide minima, the precision needed to store the
weights is lower, so they should generalize better.
Flatness can be measured by (i) the size of the connected region around the minimum
for which training loss is similar (Hochreiter & Schmidhuber, 1997a), (ii) the second-
order curvature around the minimum (Chaudhari et al., 2019), or (iii) the maximum
loss within a neighborhood of the minimum (Keskar et al., 2017). However, caution is
required; estimated atness can be aected by trivial reparameterizations of the network
due to the non-negative homogeneity property of the ReLU function (Dinh et al., 2017).
Nonetheless, Keskar et al. (2017) varied the batch size and learning rate and showed
that atness correlates with generalization. Izmailov et al. (2018) average together
weights from multiple points in a learning trajectory. This both results in atter test
and training surfaces at the minimum and improves generalization. Other regularization
techniques can also be viewed through this lens. For example, averaging model outputs
(ensembling) may also make the test loss surface atter. Kleinberg et al. (2018) showed
that large gradient variance during training helps avoid sharp regions. This may explain
why reducing the batch size and adding noise helps generalization.
The above studies consider atness for a single model and training set. However,
sharpness is not a good criterion to predict generalization between datasets; when the
Draft: please send errata to udlbookmail@gmail.com.
412 20 Why does deep learning work?
labels in the CIFAR dataset are randomized (making generalization impossible), there
is no commensurate decrease in the atness of the minimum (Neyshabur et al., 2017).
20.4.3 Architecture
The inductive bias of a network is determined by its architecture, and judicious choices
of model can drastically improve generalization. Chapter 10 introduced convolutional
networks, which are designed to process data on regular grids; they implicitly assume
that the input statistics are the same across the input, so they share parameters across
position. Similarly, transformers are suited for modeling data that is invariant to permu-
tations, and graph neural networks are suited to data represented on irregular graphs.
Matching the architecture to the properties of the data improves generalization over
generic, fully connected architectures (see gure 10.8).
20.4.4 Norm of weights
Section 20.3.4 reviewed the nding of Fort & Scherlis (2019) that the curvature of the loss
surface is unusually positive when the
2
norm of the weights lies within a certain range.
The same authors provided evidence that generalization is also good when the
2
weight
norm falls within this Goldilocks zone (gure 20.12). This is perhaps unsurprising. The
norm of the weights is (indirectly) related to the Lipschitz constant of the model. If this
norm is too small, then the model will not be able to change fast enough to capture the
variation in the underlying function. If the norm is too large, then the model will be
unnecessarily variable between training points and will not interpolate smoothly.
This nding was used by Liu et al. (2023c) to explain the phenomenon of grokking
(Power et al., 2022), in which a sudden improvement in generalization can occur many
epochs after the training error is already zero (gure 20.13). It is proposed that grokking
occurs when the norm of the weights is initially too large; the training data ts well,
but the variation of the model between the data points is large. Over time, implicit or
explicit regularization decreases the norm of the weights until they reach the Goldilocks
zone, and generalization suddenly improves.
20.4.5 Overparameterization
Figure 8.10 showed that generalization performance tends to improve with the degree
of overparameterization. When combined with the bias/variance trade-o curve, this
results in double descent. The putative explanation for this improvement is that the
network has more latitude to become smoother between the training data points when
the model is overparameterized.
It follows that the norm of the weights can also be used to explain double descent.
The norm of the weights increases when the number of parameters is similar to the
number of data points (as the model contorts itself to t these points exactly), causing
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
20.4 Factors that determine generalization 413
Figure 20.12 Generalization on hyper-
spheres. A fully connected network with
two hidden layers, each with 200 units
(198,450 parameters) was trained on the
MNIST database. The parameters are
initialized to a given
2
norm and then
constrained to maintain this norm and
to lie in a subspace (vertical direction).
The network generalizes well in a small
range around the radius r dened by
Xavier initialization (cyan dotted line).
Adapted from Fort & Scherlis (2019).
Figure 20.13 Grokking. When the pa-
rameters are initialized so that their
2
norm (radius) is considerably larger
than is specied by He initialization,
training takes longer (dashed lines), and
generalization takes much longer (solid
lines). The lag in generalization is at-
tributed to the time taken for the norm
of the weights to decrease back to the
Goldilocks zone. Adapted from Liu et al.
(2023c).
generalization to reduce. As the network becomes wider and the number of weights
increases, the overall norm of these weights decreases; the weights are initialized with a
variance that is inversely proportional to the width (i.e., with He or Glorot initialization),
and the weights change very little from their original values.
20.4.6 Leaving the data manifold
Until this point, we have discussed how models generalize to new data that is drawn from
the same distribution as the training data. This is a reasonable assumption for experi-
mentation. However, systems deployed in the real world may encounter unexpected data
due to noise, changes in the data statistics over time, or deliberate attacks. Of course,
it is harder to make denite statements about this scenario, but D’Amour et al. (2020)
show that the variability of identical models trained with dierent seeds on corrupted
data can be enormous and unpredictable.
Goodfellow et al. (2015a) showed that deep learning models are susceptible to adver-
sarial attacks. Consider perturbing an image that is correctly classied by the network
Notebook 20.4
Adversarial attacks
as “dog” so that the probability of the correct class decreases as fast as possible un-
til the class ips. If this image is now classied as an airplane, you might expect the
perturbed image to look like a cross between a dog and an airplane. However, in prac-
tice, the perturbed image looks almost indistinguishable from the original dog image
(gure 20.14).
Draft: please send errata to udlbookmail@gmail.com.
414 20 Why does deep learning work?
Figure 20.14 Adversarial examples. In
each case, the left image is correctly clas-
sied by AlexNet. By considering the
gradients of the network output with re-
spect to the input, it’s possible to nd
a small perturbation (center, magnied
by 10 for visibility) that, when added
to the original image (right), causes the
network to misclassify it as an ostrich.
This is despite the fact that the original
and perturbed images are almost indis-
tinguishable to humans. Adapted from
Szegedy et al. (2014).
The conclusion is that there are positions that are close to but not on the data man-
ifold that are misclassied. These are known as adversarial examples. Their existence
is surprising; how can such a small change to the network input make such a drastic
change to the output? The best current explanation is that adversarial examples aren’t
due to a lack of robustness to data from outside the training data manifold. Instead,
they are exploiting a source of information that is in the training distribution but which
has a small norm and is imperceptible to humans (Ilyas et al., 2019).
20.5 Do we need so many parameters?
Section 20.4 argued that models generalize better when over-parameterized. Indeed,
there are almost no examples of state-of-the-art performance on complex datasets where
the model has signicantly fewer parameters than there were training data points.
However, section 20.2 reviewed evidence that training becomes easier as the number
of parameters increases. Hence, it’s not clear if some fundamental property of smaller
models prevents them from performing as well or whether the training algorithms can’t
nd good solutions for small models. Pruning and distilling are two methods for reducing
the size of trained models. This section examines whether these methods can produce
underparameterized models which retain the performance of overparameterized ones.
20.5.1 Pruning
Pruning trained models reduces their size and hence storage requirements (gure 20.15).
The simplest approach is to remove individual weights. This can be done based on the
second derivatives of the loss function (LeCun et al., 1990; Hassibi & Stork, 1993) or
(more practically) based on the absolute value of the weight (Han et al., 2016, 2015).
Other work prunes hidden units (Zhou et al., 2016a; Alvarez & Salzmann, 2016), channels
in convolutional networks (Li et al., 2017a; Luo et al., 2017b; He et al., 2017; Liu et al.,
2019a), or entire layers in residual nets (Huang & Wang, 2018). Often, the network is
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
20.5 Do we need so many parameters? 415
Figure 20.15 Pruning neural networks. The goal is to remove as many weights
as possible without decreasing performance. This is often done just based on the
magnitude of the weights. Typically, the network is ne-tuned after pruning. a)
Example fully connected network. b) After pruning.
ne-tuned after pruning, and sometimes this process is repeated.
For example, Han et al. (2016) maintained good performance for the VGG network
on ImageNet classication when 8% of the weights were retained. This signicantly
decreases the model size but isn’t enough to show that overparameterization is not re-
quired; the VGG network has 100 times as many parameters as there are ImageNet
training data (disregarding augmentation).
Pruning is a form of architecture search. In their work on lottery tickets (see sec-
tion 20.2.7), Frankle & Carbin (2019) (i) trained a network, (ii) pruned the weights with
the smallest magnitudes, and (iii) retrained the remaining network from the same ini-
tial weights. By iterating this procedure, they reduced the size of the VGG-19 network
(originally 138 million parameters) by 98.5% on the CIFAR-10 database (60,000 exam-
ples) while maintaining good performance. For ResNet-50 (25.6 million parameters),
they reduced the parameters by 80% without reducing the performance on ImageNet
(1.28 million examples). These demonstrations are impressive but (disregarding data
augmentation) these networks are still over-parameterized after pruning.
20.5.2 Knowledge distillation
The parameters can also be reduced by training a smaller network (the student) to
replicate the performance of a larger one (the teacher). This is known as knowledge
distillation and dates back to at least Buciluǎ et al. (2006). Hinton et al. (2015) showed
that the pattern of information across the output classes is important and trained a
smaller network to approximate the pre-softmax logits of the larger one (gure 20.16).
Zagoruyko & Komodakis (2017) further encouraged the spatial maps of the activa-
tions of the student network to be similar to the teacher network at various points. They
use this attention transfer method to approximate the performance of a 34-layer residual
network (63 million parameters) with an 18-layer residual network (11 million param-
Draft: please send errata to udlbookmail@gmail.com.
416 20 Why does deep learning work?
Figure 20.16 Knowledge distillation. a) A teacher network for image classication
is trained as usual, using a multiclass cross-entropy classication loss. b) A smaller
student network is trained with the same loss, plus also a distillation loss that
encourages the pre-softmax activations to be the same as for the teacher.
eters) on the ImageNet classication task. However, this is still larger than the number
of training examples (1 million images). Modern methods (e.g. Chen et al., 2021a) can
improve on this result, but distillation has not yet provided convincing evidence that
under-parameterized models can perform well.
20.5.3 Discussion
Current evidence suggests that overparameterization is needed for generalization at
least for the size and complexity of datasets that are currently used. There are no
demonstrations of state-of-the-art performance on complex datasets where there are sig-
nicantly fewer parameters than training examples. Attempts to reduce model size by
pruning or distilling trained networks have not changed this picture.
Moreover, recent theory shows that there is a trade-o between the model’s Lipschitz
constant and overparameterization; Bubeck & Sellke (2021) proved that in D dimensions,
smooth interpolation requires D times more parameters than mere interpolation. They
argue that current models for large datasets (e.g., ImageNet) aren’t overparameterized
enough; increasing model capacity further may be key to improving performance.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
20.6 Do networks have to be deep? 417
20.6 Do networks have to be deep?
Chapter 3 discussed the universal approximation theorem. This states that shallow
neural networks can approximate any function to arbitrary accuracy given enough hidden
units. This raises the obvious question of whether networks need to be deep.
First, let’s consider the evidence that depth is required. Historically, there has been
a denite correlation between performance and depth. For example, performance on the
ImageNet benchmark initially improved as a function of network depth until training
became dicult. Subsequently, residual connections and batch normalization (chap-
ter 11) allowed training of deeper networks with commensurate gains in performance.
At the time of writing, almost all state-of-the-art applications, including image classica-
tion (e.g., the vision transformer), text generation (e.g., GPT3), and text-guided image
synthesis (e.g., DALL·E-2), are based on deep networks with tens or hundreds of layers.
Despite this trend, there have been eorts to use shallower networks. Zagoruyko &
Komodakis (2016) constructed shallower but wider residual neural networks and achieved
similar performance to ResNet. More recently, Goyal et al. (2021) constructed a network
that used parallel convolutional channels and achieved performance similar to deeper net-
works with only 12 layers. Furthermore, Veit et al. (2016) showed that it is predominantly
shorter paths of 5–17 layers that drive performance in residual networks.
Nonetheless, the balance of evidence suggests that depth is critical; even the shallow-
est networks with good image classication performance require >10 layers. However,
there is no denitive explanation for why. Three possible explanations are that (i) deep
networks can represent more complex functions than shallow ones, (ii) deep networks
are easier to train, and (iii) deep networks impose better inductive biases.
20.6.1 Complexity of modeled function
Chapter 4 showed that deep networks make functions with many more linear regions than
shallow ones for the same parameter count. We also saw that “pathological” functions
have been identied that require exponentially more hidden units to model with a shallow
network than a deep one (e.g., Eldan & Shamir, 2016; Telgarsky, 2016). Indeed Liang &
Srikant (2016) found quite general families of functions that are more eciently modeled
by deep networks. However, Nye & Saxe (2018) found that some of these functions
cannot easily be t by deep networks in practice. Moreover, there is little evidence that
the real-world functions that we are approximating have these pathological properties.
20.6.2 Tractability of training
An alternative explanation is that shallow networks with a practical number of hidden
units could support state-of-the-art performance, but it is just dicult to nd a good
solution that both ts the training data well and interpolates sensibly.
One way to show this is to distill successful deep networks into shallower (but wider)
student models and see if performance can be maintained. Urban et al. (2017) dis-
Draft: please send errata to udlbookmail@gmail.com.
418 20 Why does deep learning work?
tilled an ensemble of 16 convolutional networks for image classication on the CIFAR-10
dataset into student models of varying depths. They found that shallow networks could
not replicate the performance of the deeper teacher and that the student performance
increased as a function of depth for a constant parameter budget.
20.6.3 Inductive bias
Most current models rely on convolutional blocks or transformers. These networks share
parameters for local regions of the input data, and often they gradually integrate this
information across the whole input. These constraints mean that the functions that
these networks can represent are not general. One explanation for the supremacy of
deep networks, then, is that these constraints have a good inductive bias and that it is
dicult to force shallow networks to obey these constraints.
Multi-layer convolutional architectures seem to be inherently helpful, even without
training. Ulyanov et al. (2018) demonstrated that the structure of an untrained CNN
can be used as a prior in low-level tasks such as denoising and super-resolution. Frankle
et al. (2021) achieved good performance in image classication by initializing the kernels
randomly, xing their values, and just training the batch normalization oset and scaling
factors. Zhang et al. (2017a) show that features from randomly initialized convolutional
lters can support subsequent image classication using a kernel model.
Additional evidence that convolutional networks provide a useful inductive bias comes
from Urban et al. (2017), who attempted to distill convolutional networks into shal-
lower networks. They found that distilling into convolutional architectures systemat-
ically worked better than distilling into fully connected networks. This suggests that
the convolutional architecture has some inherent advantages. Since the sequential local
processing of convolutional networks cannot easily be replicated by shallower networks,
this argues that depth is indeed important.
20.7 Summary
This chapter has made the case that the success of deep learning is surprising. We
discussed the challenges of optimizing high-dimensional loss functions and argued that
overparameterization and the choice of activation function are the two most important
factors that make this tractable in deep networks. We saw that, during training, the
parameters move through a low-dimensional subspace to one of a family of connected
global minima and that local minima are not apparent.
Generalization of neural networks also improves with overparameterization, although
other factors, such as the atness of the minimum and the inductive bias of the architec-
ture, are also important. It appears that both a large number of parameters and multiple
network layers are required for good generalization, although we do not yet know why.
Many questions remain unanswered. We do not currently have any prescriptive theory
that will allow us to predict the circumstances in which training and generalization will
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
20.7 Summary 419
succeed or fail. We do not know the limits of learning in deep networks or whether
much more ecient models are possible. We do not know if there are parameters that
would generalize better within the same model. The study of deep learning is still driven
by empirical demonstrations. These are undeniably impressive, but they are not yet
matched by our understanding of deep learning mechanisms.
Problems
Problem 20.1 Consider the ImageNet image classication task in which the input images contain
224×224×3 RGB values. Consider coarsely quantizing these inputs into ten bins per RGB value
and training with 10
7
training examples. How many possible inputs are there per training
data point?
Problem 20.2 Consider gure 20.1. Why do you think that the algorithm ts the data faster
when the pixels are randomized relative to when the labels are randomized?
Problem 20.3 Figure 20.2 shows a non-stochastic tting process with a xed learning rate
successfully tting random data. Does this imply that the loss function has no local minima?
Does this imply that the function is convex? Justify your answer and give a counter-example if
you think either statement is false.
Draft: please send errata to udlbookmail@gmail.com.
Chapter 21
Deep learning and ethics
This chapter was written by Travis LaCroix and Simon J.D. Prince.
AI is poised to change society for better or worse. These technologies have enormous
potential for social good (Taddeo & Floridi, 2018; Tomašev et al., 2020), including im-
portant roles in healthcare (Rajpurkar et al., 2022) and the ght against climate change
(Rolnick et al., 2023). However, they also have the potential for misuse and unintended
harm. This has led to the emergence of the eld of AI ethics.
The modern era of deep learning started in 2012 with AlexNet, but sustained interest
in AI ethics did not follow immediately. Indeed, a workshop on fairness in machine
learning was rejected from NeurIPS 2013 for want of material. It wasn’t until 2016 that
AI Ethics had its “AlexNet” moment, with ProPublica’s exposé on bias in the COMPAS
recidivism-prediction model (Angwin et al., 2016) and Cathy O’Neil’s book Weapons
of Math Destruction (O’Neil, 2016). Interest has swelled ever since; submissions to the
Conference on Fairness, Accountability, and Transparency (FAccT) have increased nearly
ten-fold in the ve years since its inception in 2018.
In parallel, many organizations have proposed policy recommendations for responsible
AI. Jobin et al. (2019) found 84 documents containing AI ethics principles, with 88%
released since 2016. This proliferation of non-legislative policy agreements, which depend
on voluntary, non-binding cooperation, calls into question their ecacy (McNamara
et al., 2018; Hagendor, 2020; LaCroix & Mohseni, 2022). In short, AI Ethics is in its
infancy, and ethical considerations are often reactive rather than proactive.
This chapter considers potential harms arising from the design and use of AI systems.
These include algorithmic bias, lack of explainability, data privacy violations, militariza-
tion, fraud, and environmental concerns. The aim is not to provide advice on being more
ethical. Instead, the goal is to express ideas and start conversations in key areas that
have received attention in philosophy, political science, and the broader social sciences.
21.1 Value alignment
When we design AI systems, we wish to ensure that their “values” (objectives) are aligned
with those of humanity. This is sometimes called the value alignment problem (Russell,
Problem 21.1
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
21.1 Value alignment 421
2019; Christian, 2020; Gabriel, 2020). This is challenging for three reasons. First, it’s
dicult to dene our values completely and correctly. Second, it is hard to encode these
values as objectives of an AI model, and third, it is hard to ensure that the model learns
to carry out these objectives.
In a machine learning model, the loss function is a proxy for our true objectives,
Problem 21.2
and a misalignment between the two is termed the outer alignment problem (Hubinger
et al., 2019). To the extent that this proxy is inadequate, there will be “loopholes” that
the system can exploit to minimize its loss function while failing to satisfy the intended
objective. For example, consider training an RL agent to play chess. If the agent is
rewarded for capturing pieces, this may result in many drawn games rather than the
desired behavior (to win the game). In contrast, the inner alignment problem is to
ensure that the behavior of an AI system does not diverge from the intended objectives
even when the loss function is well specied. If the learning algorithm fails to nd
the global minimum or the training data are unrepresentative, training can converge to
a solution that is misaligned with the true objective resulting in undesirable behavior
(Goldberg, 1987; Mitchell et al., 1992; Lehman & Stanley, 2008).
Gabriel (2020) divides the value alignment problem into technical and normative
components. The technical component concerns how we encode values into the models
so that they reliably do what they should. Some concrete problems, such as avoiding
reward hacking and safe exploration, may have purely technical solutions (Amodei et al.,
2016). In contrast, the normative component concerns what the correct values are in the
rst place. There may be no single answer to this question, given the range of things
that dierent cultures and societies value. It’s important that the encoded values are
representative of everyone and not just culturally dominant subsets of society.
Another way to think about value alignment is as a structural problem that arises
when a human principal delegates tasks to an articial agent (LaCroix, 2022). This is
similar to the principal-agent problem in economics (Laont & Martimort, 2002), which
allows that there are competing incentives inherent in any relationship where one party
is expected to act in another’s best interests. In the AI context, such conicts of interest
can arise when either (i) the objectives are misspecied or (ii) there is an informational
asymmetry between the principal and the agent (gure 21.1).
Many topics in AI ethics can be understood in terms of this structural view of value
alignment. The following sections discuss problems of bias and fairness and articial
moral agency (both pertaining to specifying objectives) and transparency and explain-
ability (both related to informational asymmetry).
21.1.1 Bias and fairness
From a purely scientic perspective, bias refers to statistical deviation from some norm.
In AI, it can be pernicious when this deviation depends on illegitimate factors that impact
an output. For example, gender is irrelevant to job performance, so it is illegitimate to
use gender as a basis for hiring a candidate. Similarly, race is irrelevant to criminality,
so it is illegitimate to use race as a feature for recidivism prediction.
Bias in AI models can be introduced in various ways (Fazelpour & Danks, 2021):
Draft: please send errata to udlbookmail@gmail.com.
422 21 Deep learning and ethics
Figure 21.1 Structural description of the value alignment problem. a) Problems
arise from a) misaligned objectives (e.g., bias) or b) informational asymmetries
between a (human) principal and an (articial) agent (e.g., lack of explainability).
Adapted from LaCroix (2023).
Problem specication: Choosing a model’s goals requires a value judgment
about what is important to us, which allows for the creation of biases (Fazelpour
& Danks, 2021). Further biases may emerge if we fail to operationalize these
choices successfully and the problem specication fails to capture our intended
goals (Mitchell et al., 2021).
Data: Algorithmic bias can result when the dataset is unrepresentative or incom-
plete (Danks & London, 2017). For example, the PULSE face super-resolution
algorithm (Menon et al., 2020) was trained on a database of photos of predom-
inantly white celebrities. When applied to a low-resolution portrait of Barack
Obama, it generated a photo of a white man (Vincent, 2020).
If the society in which training data are generated is structurally biased against
marginalized communities, even complete and representative datasets will elicit
biases (Mayson, 2018). For example, Black individuals in the US have been policed
and jailed more frequently than white individuals. Hence, historical data used to
train recidivism prediction models are already biased against Black communities.
Modeling and validation: Choosing a mathematical denition to measure model
fairness requires a value judgment. There exist distinct but equally-intuitive de-
nitions that are logically inconsistent (Kleinberg et al., 2017; Chouldechova, 2017;
Berk et al., 2017). This suggests the need to move from a purely mathemati-
cal conceptualization of fairness toward a more substantive evaluation of whether
algorithms promote justice in practice (Green, 2022).
Deployment: Deployed algorithms may interact with other algorithms, struc-
tures, or institutions in society to create complex feedback loops that entrench ex-
tant biases (O’Neil, 2016). For example, large language models like GPT3 (Brown
et al., 2020) are trained on web data. However, when GPT3 outputs are published
Problem 21.3
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
21.1 Value alignment 423
Data collection Pre-processing Training Post-processing
Identify lack of
examples or
variates and collect
Modify labels
Modify input data
Modify input/
output pairs
Adversarial training
Regularize for fairness
Constrain to be fair
Change thresholds
Trade-o accuracy
for fairness
Figure 21.2 Bias mitigation. Methods have been proposed to compensate for bias
at all stages of the training pipeline, from data collection to post-processing of
already trained models. See Barocas et al. (2023) and Mehrabi et al. (2022).
online, the training data for future models is degraded. This may exacerbate biases
and generate novel societal harm (Falbo & LaCroix, 2022).
Unfairness can be exacerbated by considerations of intersectionality; social categories
can combine to create overlapping and interdependent systems of oppression. For ex-
ample, the discrimination experienced by a queer woman of color is not merely the
sum of the discrimination she might experience as queer, as gendered, or as racialized
(Crenshaw, 1991). Within AI, Buolamwini & Gebru (2018) showed that face analysis
algorithms trained primarily on lighter-skinned faces underperform for darker-skinned
faces. However, they perform even worse on combinations of features such as skin color
and gender than might be expected by considering those features independently.
Of course, steps can be taken to ensure that data are diverse, representative, and
complete. But if the society in which the training data are generated is structurally biased
against marginalized communities, even completely accurate datasets will elicit biases.
In light of the potential for algorithmic bias and the lack of representation in training
datasets described above, it is also necessary to consider how failure rates for the outputs
of these systems are likely to exacerbate discrimination against already-marginalized
communities (Buolamwini & Gebru, 2018; Raji & Buolamwini, 2019; Raji et al., 2022).
The resulting models may codify and entrench systems of power and oppression, including
capitalism and classism; sexism, misogyny, and patriarchy; colonialism and imperialism;
racism and white supremacy; ableism; and cis- and heteronormativity. A perspective
on bias that maintains sensitivity to power dynamics requires accounting for historical
inequities and labor conditions encoded in data (Micelli et al., 2022).
To prevent this, we must actively ensure that our algorithms are fair. A naïve ap-
proach is fairness through unawareness which simply removes the protected attributes
(e.g., race, gender) from the input features. Unfortunately, this is ineective; the remain-
ing features can still carry information about the protected attributes. More practical
approaches rst dene a mathematical criterion for fairness. For example, the separation
measure in binary classication requires that the prediction ˆy is conditionally indepen-
dent of the protected variable a (e.g., race) given the true label y. Then they intervene
in various ways to minimize the deviation from this measure (gure 21.2).
Notebook 21.1
Bias mitigation
A further complicating factor is that we cannot tell if an algorithm is unfair to a com-
munity or take steps to avoid this unless we can establish community membership. Most
research on algorithmic bias and fairness has focused on ostensibly observable features
Draft: please send errata to udlbookmail@gmail.com.
424 21 Deep learning and ethics
that might be present in training data (e.g., gender). However, features of marginalized
communities may be unobservable, making bias mitigation even more dicult. Examples
include queerness (Tomasev et al., 2021), disability status, neurotype, class, and religion.
A similar problem occurs when observable features have been excised from the training
data to prevent models from exploiting them.
21.1.2 Articial moral agency
Many decision spaces do not include actions that carry moral weight. For example,
choosing the next chess move has no obvious moral consequence. However, elsewhere
actions can carry moral weight. Examples include decision-making in autonomous vehi-
cles (Awad et al., 2018; Evans et al., 2020), lethal autonomous weapons systems (Arkin,
2008a,b), and professional service robots for childcare, elderly care, and health care (An-
derson & Anderson, 2008; Sharkey & Sharkey, 2012). As these systems become more
autonomous, they may need to make moral decisions independent of human input.
This leads to the notion of articial moral agency. An articial moral agent is
an autonomous AI system capable of making moral judgments. Moral agency can be
categorized in terms of increasing complexity (Moor, 2006):
1. Ethical impact agents are agents whose actions have ethical impacts. Hence,
almost any technology deployed in society might count as an ethical impact agent.
2. Implicit ethical agents are ethical impact agents that include some in-built
safety features.
3. Explicit ethical agents can contextually follow general moral principles or rules
of ethical conduct.
4. Full ethical agents are agents with beliefs, desires, intentions, free will, and
consciousness of their actions.
The eld of machine ethics seeks approaches to creating articial moral agents. These
approaches can be categorized as top-down, bottom-up, or hybrid (Allen et al., 2005). Top-
down (theory-driven) methods directly implement and hierarchically arrange concrete
rules based on some moral theory to guide ethical behavior. Asimov’s “Three Laws of
Robotics” are a trivial example of this approach.
In bottom-up (learning-driven) approaches, a model learns moral regularities from
data without explicit programming (Wallach et al., 2008). For example, Noothigattu
et al. (2018) designed a voting-based system for ethical decision-making that uses data
collected from human preferences in moral dilemmas to learn social preferences; the sys-
tem then summarizes and aggregates the results to render an “ethical” decision. Hybrid
approaches combine top-down and bottom-up approaches.
Some researchers have questioned the very idea of articial moral agency and argued
that moral agency is unnecessary for ensuring safety (van Wynsberghe & Robbins, 2019).
See Cervantes et al. (2019) for a recent survey of articial moral agency and Tolmeijer
et al. (2020) for a recent survey on technical approaches to articial moral agency.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
21.1 Value alignment 425
21.1.3 Transparency and opacity
A complex computational system is transparent if all of the details of its operation are
known. A system is explainable if humans can understand how it makes decisions. In the
absence of transparency or explainability, there is an asymmetry of information between
the user and the AI system, which makes it hard to ensure value alignment.
Creel (2020) characterizes transparency at several levels of granularity. Functional
transparency refers to knowledge of the algorithmic functioning of the system (i.e., the
logical rules that map inputs to outputs). The methods in this book are described at
this level of detail. Structural transparency entails knowing how a program executes the
algorithm. This can be obscured when commands written in high-level programming lan-
guages are executed by machine code. Finally, run transparency requires understanding
how a program was executed in a particular instance. For deep networks, this includes
Problem 21.4
knowledge about the hardware, input data, training data, and interactions thereof. None
of these can be ascertained by scrutinizing code.
For example, GPT3 is functionally transparent; its architecture is described in Brown
et al. (2020). However, it does not exhibit structural transparency as we do not have
access to the code, and it does not exhibit run transparency as we have no access to the
learned parameters, hardware, or training data. The subsequent version GPT4 is not
transparent at all. The details of how this commercial product works are unknown.
21.1.4 Explainability and interpretability
Even if a system is transparent, this does not imply that we can understand how a
decision is made or what information this decision is based on. Deep networks may
contain billions of parameters, so there is no way we can understand how they work
based on examination alone. However, in some jurisdictions, the public may have a right
to an explanation. Article 22 of the EU General Data Protection Regulation suggests all
data subjects should have the right to “obtain an explanation of the decision reached”
in cases where a decision is based solely on automated processes.
1
These diculties have led to the sub-eld of explainable AI. One moderately success-
ful area is producing local explanations. Although we can’t explain the entire system,
Notebook 21.2
Explainability
we can sometimes describe how a particular input was classied. For example, Local
interpretable model-agnostic explanations or LIME (Ribeiro et al., 2016) samples the
model output at nearby inputs and uses these samples to construct a simpler model
(gure 21.3). This provides insight into the classication decision, even if the original
model is neither transparent nor explainable.
It remains to be seen whether it is possible to build complex decision-making systems
that are fully understandable to their users or even their creators. There is also an
ongoing debate about what it means for a system to be explainable, understandable, or
interpretable (Erasmus et al., 2021); there is currently no concrete denition of these
concepts. See Molnar (2022) for more information.
1
Whether Article 22 actually mandates such a right is debatable (see Wachter et al., 2017).
Draft: please send errata to udlbookmail@gmail.com.
426 21 Deep learning and ethics
Figure 21.3 LIME. Output functions of deep networks are complex; in high di-
mensions, it’s hard to know why a decision was made or how to modify the
inputs to change it without access to the model. a) Consider trying to under-
stand why P r(y = 1|x) is low at the white cross. LIME probes the network at
nearby points to see if it identies these as P r(y = 1|x) < 0.5 (cyan points) or
P r(y = 1|x) 0.5 (gray points). It weights these points by proximity to the
point of interest (weight indicated by circle size). b) The weighted points are
used to train a simpler model (here, logistic regression a linear function passed
through a sigmoid). c) Near the white cross, this approximation is close to d) the
original function. Even though we did not have access to the original model, we
can deduce from the parameters of this approximate model, that if we increase
x
1
or decrease x
2
, P r(y = 1|x) will increase, and the output class will change.
Adapted from Prince (2022).
21.2 Intentional misuse
The problems in the previous section arise from poorly specied objectives and infor-
mational asymmetries. However, even when a system functions correctly, it can entail
unethical behavior or be intentionally misused. This section highlights some specic
Problem 21.5
ethical concerns arising from the misuse of AI systems.
21.2.1 Face recognition and analysis
Face recognition technologies have an especially high risk for misuse. Authoritarian
states can use them to identify and silence protesters, thus risking democratic ideals
of free speech and the right to protest. Smith & Miller (2022) argue that there is a
mismatch between the values of liberal democracy (e.g., security, privacy, autonomy,
and accountability) and the potential use cases for these technologies (e.g., border se-
curity, criminal investigation and policing, national security, and the commercialization
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
21.2 Intentional misuse 427
of personal data). Thus, some researchers, activists, and policymakers have questioned
whether this technology should exist (Barrett, 2020).
Moreover, these technologies often do not do what they purport to (Raji et al., 2022).
For example, the New York Metropolitan Transportation Authority moved forward with
and expanded its use of facial recognition despite a proof-of-concept trial reporting a
100% failure rate to detect faces within acceptable parameters (Berger, 2019). Similarly,
facial analysis tools often oversell their abilities (Raji & Fried, 2020), dubiously claiming
to be able to infer individuals’ sexual orientation (Leuner, 2019), emotions (Stark & Hoey,
2021), hireability (Fetscherin et al., 2020), or criminality (Wu & Zhang, 2016). Stark
& Hutson (2022) highlight that computer vision systems have created a resurgence in
the “scientically baseless, racist, and discredited pseudoscientic elds” of physiognomy
and phrenology.
21.2.2 Militarization and political interference
Governments have a vested interest in funding AI research in the name of national
security and state building. This risks an arms race between nation-states, which carries
with it “high rates of investment, a lack of transparency, mutual suspicion and fear, and
a perceived intent to deploy rst” (Sisson et al., 2020).
Lethal autonomous weapons systems receive signicant attention because they are
Problem 21.6
easy to imagine, and indeed many such systems are under development (Heikkilä, 2022).
However, AI also facilitates cyber-attacks and disinformation campaigns (i.e., inaccurate
or misleading information that is shared with the intent to deceive). AI systems allow the
creation of highly realistic fake content and facilitate the dissemination of information,
often to targeted audiences (Akers et al., 2018) and at scale (Bontridder & Poullet, 2021).
Kosinski et al. (2013) suggest that sensitive variables, including sexual orientation,
ethnicity, religious and political views, personality traits, intelligence, happiness, use of
addictive substances, parental separation, age, and gender, can be predicted by “likes”
on social media alone. From this information, personality traits like “openness” can be
used for manipulative purposes (e.g., to change voting behavior).
21.2.3 Fraud
Unfortunately, AI is a useful tool for automating fraudulent activities (e.g., sending mass
emails or text messages that trick people into revealing sensitive information or sending
money). Generative AI can be used to deceive people into thinking they are interacting
with a legitimate entity or generate fake documents that mislead or deceive people.
Additionally, AI could increase the sophistication of cyber-attacks, such as by generating
more convincing phishing emails or adapting to the defenses of targeted organizations.
This highlights the downside of calls for transparency in machine learning systems:
the more open and transparent these systems are, the more vulnerable they may be to
security risks or use by bad-faith actors. For example, generative language models, like
Problem 21.7
ChatGPT, have been used to write software and emails that could be used for espionage,
Draft: please send errata to udlbookmail@gmail.com.
428 21 Deep learning and ethics
ransomware, and other malware (Goodin, 2023).
The tendency to anthropomorphize computer behaviors and particularly the projec-
tion of meaning onto strings of symbols is termed the ELIZA eect (Hofstadter, 1995).
This leads to a false sense of security when interacting with sophisticated chatbots, mak-
ing people more susceptible to text-based fraud such as romance scams or business email
compromise schemes (Abrahams, 2023). Véliz (2023) highlights how emoji use in some
chatbots is inherently manipulative, exploiting instinctual responses to emotive images.
21.2.4 Data privacy
Modern deep learning methods rely on huge crowd-sourced datasets, which may contain
sensitive or private information. Even when sensitive information is removed, auxiliary
Problem 21.8
knowledge and redundant encodings can be used to de-anonymize datasets (Narayanan
& Shmatikov, 2008). Indeed, this famously happened to the Governor of Massachusetts,
William Weld, in 1997. After an insurance group released health records that had been
stripped of obvious personal information like patient name and address, an aspiring
graduate student was able to “de-anonymize” which records belonged to Governor Weld
by cross-referencing with public voter rolls.
Hence, privacy-rst design is important for ensuring the security of individuals’ in-
formation, especially when applying deep learning techniques to high-risk areas such
as healthcare and nance. Dierential privacy and semantic security (homomorphic en-
cryption or secure multi-party computation) methods can be used to ensure data security
during model training (see Mireshghallah et al., 2020; Boulemtafes et al., 2020).
21.3 Other social, ethical, and professional issues
The previous section identied areas where AI can be deliberately misused. This section
describes other potential side eects of the widespread adoption of AI.
21.3.1 Intellectual property
Intellectual property (IP) can be characterized as non-physical property that is the prod-
uct of original thought (Moore & Himma, 2022). In practice, many AI models are trained
on copyrighted material. Consequently, these models’ deployment can pose legal and
ethical risks and run afoul of intellectual property rights (Henderson et al., 2023).
Sometimes, these issues are explicit. When language models are prompted with
excerpts of copyrighted material, their outputs may include copyrighted text verbatim,
and similar issues apply in the context of image generation in diusion models (Henderson
et al., 2023; Carlini et al., 2022, 2023). Even if the training falls under “fair use,” this
may violate the moral rights of content creators in some cases (Weidinger et al., 2022).
More subtly, generative models (chapters 12,14–18) raise novel questions regarding AI
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
21.3 Other social, ethical, and professional issues 429
and intellectual property. Can the output of a machine learning model (e.g., art, music,
code, text) be copyrighted or patented? Is it morally acceptable or legal to ne-tune a
model on a particular artist’s work to reproduce that artist’s style? IP law is one area
Problem 21.9
that highlights how existing legislation was not created with machine learning models
in mind. Although governments and courts may set precedents in the near future, these
questions are still open at the time of writing.
21.3.2 Automation bias and moral deskilling
As society relies more on AI systems, there is an increased risk of automation bias (i.e.,
expectations that the model outputs are correct because they are “objective”). This
leads to the view that quantitative methods are better than qualitative ones. However,
as we shall see in section 21.5, purportedly objective endeavors are rarely value-free.
The sociological concept of deskilling refers to the redundancy and devaluation of
skills in light of automation (Braverman, 1974). For example, o-loading cognitive skills
like memory onto technology may cause a decrease in our capacity to remember things.
Analogously, the automation of AI in morally-loaded decision-making may lead to a
decrease in our moral abilities (Vallor, 2015). For example, in the context of war, the
automation of weapons systems may lead to the dehumanization of victims of war (Asaro,
2012; Heyns, 2017). Similarly, care robots in elderly-, child-, or healthcare settings may
reduce our ability to care for one another (Vallor, 2011).
21.3.3 Environmental impact
Training deep networks requires signicant computational power and hence consumes a
large amount of energy. Strubell et al. (2019, 2020) estimate that training a transformer
model with 213 million parameters emitted around 284 tonnes of CO
2
.
2
Luccioni et al.
(2022) have provided similar estimates for the emissions produced from training the
BLOOM language model. Unfortunately, the increasing prevalence of closed, proprietary
models means that we know nothing about their environmental impacts (Luccioni, 2023).
21.3.4 Employment and society
The history of technological innovation is a history of job displacement. In 2018, the
McKinsey Global Institute estimated that AI may increase economic output by approx-
imately US $13 trillion by 2030, primarily from the substitution of labor by automation
(Bughin et al., 2018). Another study from the McKinsey Global Institute suggests that
up to 30% of the global workforce (10-800 million people) could have their jobs displaced
due to AI between 2016 and 2030 (Manyika et al., 2017; Manyika & Sneader, 2018).
2
As a baseline, it is estimated that the average human is responsible for around 5 tonnes of CO
2
per year, with individuals from major oil-producing countries responsible for three times this amount.
See https://ourworldindata.org/co2-emissions.
Draft: please send errata to udlbookmail@gmail.com.
430 21 Deep learning and ethics
However, forecasting is inherently dicult, and although automation by AI may lead
Problem 21.10
to short-term job losses, the concept of technological unemployment has been described as
a “temporary phase of maladjustment” (Keynes, 2010). This is because gains in wealth
can oset gains in productivity by creating increased demand for products and services.
In addition, new technologies can create new types of jobs.
Even if automation doesn’t lead to a net loss of overall employment in the long term,
new social programs may be required in the short term. Therefore, regardless of whether
one is optimistic (Brynjolfsson & McAfee, 2016; Danaher, 2019), neutral (Metcalf et al.,
2016; Calo, 2018; Frey, 2019), or pessimistic (Frey & Osborne, 2017) about the possibility
of unemployment in light of AI, it is clear that society will be changed signicantly.
21.3.5 Concentration of power
As deep networks increase in size, there is a corresponding increase in the amount of data
and computing power required to train these models. In this regard, smaller companies
and start-ups may not be able to compete with large, established tech companies. This
may give rise to a feedback loop whereby the power and wealth become increasingly
concentrated in the hands of a small number of corporations. A recent study nds
an increasing discrepancy between publications at major AI venues by large tech rms
and “elite” universities versus mid- or lower-tier universities (Ahmed & Wahed, 2016).
In many views, such a concentration of wealth and power is incompatible with just
distributions in society (Rawls, 1971).
This has led to calls to democratize AI by making it possible for everyone to create
Problem 21.11
such systems (Li, 2018; Knight, 2018; Kratsios, 2019; Riedl, 2020). Such a process
requires making deep learning technologies more widely available and easier to use via
open source and open science so that more people can benet from them. This reduces
barriers to entry and increases access to AI while cutting down costs, ensuring model
accuracy, and increasing participation and inclusion (Ahmed et al., 2020).
21.4 Case study
We now describe a case study that speaks to many of the issues that we have discussed
in this chapter. In 2018, the popular media reported on a controversial facial analysis
model—dubbed “gaydar AI” (Wang & Kosinski, 2018)—with sensationalist headlines like
AI Can Tell If You’re Gay: Articial Intelligence Predicts Sexuality From One Photo
with Startling Accuracy (Ahmed, 2017); A Frightening AI Can Determine Whether a
Person Is Gay With 91 Percent Accuracy (Matsakis, 2017); and Articial Intelligence
System Can Tell If You’re Gay (Fernandez, 2017).
There are a number of problems with this work. First, the training dataset was highly
biased and unrepresentative, being comprised mostly of Caucasian images. Second,
modeling and validation are also questionable, given the uidity of gender and sexuality.
Third, the most obvious use case for such a model is the targeted discrimination and
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
21.5 The value-free ideal of science 431
persecution of LGBTQ+ individuals in countries where queerness is criminalized. Fourth,
with regard to transparency, explainability, and value alignment more generally, the
“gaydar” model appears to pick up on spurious correlations due to patterns in grooming,
presentation, and lifestyle rather than facial structure, as the authors claimed (Agüera y
Arcas et al., 2018). Fifth, with regard to data privacy, questions arise regarding the ethics
of scraping “public” photos and sexual orientation labels from a dating website. Finally,
with regard to scientic communication, the researchers communicated their results in a
way that was sure to generate headlines: even the title of the paper is an overstatement of
the model’s abilities: Deep Neural Networks Can Detect Sexual Orientation from Faces.
(They cannot.)
It should also be apparent that a facial-analysis model for determining sexual orien-
tation does nothing whatsoever to benet the LGBTQ+ community. If it is to benet
society, the most important question is whether a particular study, experiment, model,
application, or technology serves the interests of the community to which it pertains.
21.5 The value-free ideal of science
This chapter has enumerated a number of ways that the objectives of AI systems can
unintentionally, or through misuse, diverge from the values of humanity. We now argue
that scientists are not neutral actors; their values inevitably impinge on their work.
Perhaps this is surprising. There is a broad belief that science is—or ought to be—
objective. This is codied by the value-free ideal of science. Many would argue that
machine learning is objective because algorithms are just mathematics. However, analo-
gous to algorithmic bias (section 21.1.1), there are four stages at which the values of AI
practitioners can aect their work (Reiss & Sprenger, 2017):
1. The choice of research problem.
2. Gathering evidence related to a research problem.
3. Accepting a scientic hypothesis as an answer to a problem.
4. Applying the results of scientic research.
It is perhaps uncontroversial that values play a signicant role in the rst and last of
these stages. The initial selection of research problems and the choice of subsequent ap-
plications are inuenced by the interests of scientists, institutions, and funding agencies.
However, the value-free ideal of science prescribes minimizing the inuence of moral,
personal, social, political, and cultural values on the intervening scientic process. This
idea presupposes the value-neutrality thesis, which suggests that scientists can (at least
in principle) attend to stages (2) and (3) without making these value judgments.
However, whether intentional or not, values are embedded in machine learning re-
search. Most of these values would be classed as epistemic (e.g., performance, gener-
alization, building on past work, eciency, novelty). But deciding the set of values is
itself a value-laden decision; few papers explicitly discuss societal need, and fewer still
discuss potential negative impacts (Birhane et al., 2022b). Philosophers of science have
Draft: please send errata to udlbookmail@gmail.com.
432 21 Deep learning and ethics
questioned whether the value-free ideal of science is attainable or desirable. For exam-
ple, Longino (1990, 1996) argues that these epistemic values are not purely epistemic.
Kitcher (2011a,b) argues that scientists don’t typically care about truth itself; instead,
they pursue truths relevant to their goals and interests.
Machine learning depends on inductive inference and is hence prone to inductive risk.
Models are only constrained at the training data points, and the curse of dimensionality
means this is a tiny proportion of the input space; outputs can always be wrong, regard-
less of how much data we use to train the model. It follows that choosing to accept or
reject a model prediction requires a value judgment: that the risks if we are wrong in
acceptance are lower than the risks if we are wrong in rejection.
Hence, the use of inductive inference implies that machine learning models are deeply
value-laden (Johnson, 2022). In fact, if they were not, they would have no application:
it is precisely because they are value-laden that they are useful. Thus, accepting that
algorithms are used for ranking, sorting, ltering, recommending, categorizing, label-
ing, predicting, etc., in the real world implies that these processes will have real-world
eects. As machine learning systems become increasingly commercialized and applied,
they become more entrenched in the things we care about.
These insights have implications for researchers who believe that algorithms are some-
how more objective than human decision-makers (and, therefore, ought to replace human
decision-makers in areas where we think objectivity matters).
21.6 Responsible AI research as a collective action problem
It is easy to defer responsibility. Students and professionals who read this chapter might
think their work is so far removed from the real world or a small part of a larger machine
that their actions could not make a dierence. However, this is a mistake. Researchers
often have a choice about the projects to which they devote their time, the companies
or institutions for which they work, the knowledge they seek, the social and intellectual
circles in which they interact, and the way they communicate.
Doing the right thing, whatever that may comprise, often takes the form of a social
dilemma; the best outcomes depend upon cooperation, although it isn’t necessarily in any
Problem 21.12
individual’s interest to cooperate: responsible AI research is a collective action problem.
21.6.1 Scientic communication
One positive step is to communicate responsibly. Misinformation spreads faster and
persists more readily than the truth in many types of social networks (LaCroix et al.,
2021; Ceylan et al., 2023). As such, it is important not to overstate machine learning
systems’ abilities (see case study above) and to avoid misleading anthropomorphism. It
is also important to be aware of the potential for the misapplication of machine learning
techniques. For example, pseudoscientic practices like phrenology and physiognomy
have found a surprising resurgence in AI (Stark & Hutson, 2022).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
21.7 Ways forward 433
21.6.2 Diversity and heterogeneity
A second positive step is to encourage diversity. When social groups are homogeneous
(composed mainly of similar members) or homophilous (comprising members that tend
to associate with similar others), the dominant group tends to have its conventions
recapitulated and stabilized (O’Connor & Bruner, 2019). One way to mitigate systems
of oppression is to ensure that diverse views are considered. This might be achieved
through equity, diversity, inclusion, and accessibility initiatives (at an institutional level),
participatory and community-based approaches to research (at the research level), and
increased awareness of social, political, and moral issues (at an individual level).
The theory of standpoint epistemology (Harding, 1986) suggests that knowledge is
socially situated (i.e., depends on one’s social position in society). Homogeneity in tech
circles can give rise to biased tech (Noble, 2018; Eubanks, 2018; Benjamin, 2019; Brous-
sard, 2023). Lack of diversity implies that the perspectives of the individuals who create
these technologies will seep into the datasets, algorithms, and code as the default perspec-
tive. Broussard (2023) argues that because much technology is developed by able-bodied,
white, cisgender, American men, that technology is optimized for able-bodied, white, cis-
gender, American men, the perspective of whom is taken as the status quo. Ensuring
technologies benet historically marginalized communities requires researchers to under-
stand the needs, wants, and perspectives of those communities (Birhane et al., 2022a).
Design justice and participatory- and community-based approaches to AI research con-
tend that the communities aected by technologies should be actively involved in their
design (Constanza-Chock, 2020).
21.7 Ways forward
It is undeniable that AI will radically change society for better or worse. However,
optimistic visions of a future Utopian society driven by AI should be met with caution and
a healthy dose of critical reection. Many of the touted benets of AI are benecial only
in certain contexts and only to a subset of society. For example, Green (2019) highlights
that one project developed using AI to enhance police accountability and alternatives to
incarceration and another developed to increase security through predictive policing are
both advertised as “AI for Social Good. Assigning this label is a value judgment that
lacks any grounding principles; one community’s good is another’s harm.
When considering the potential for emerging technologies to benet society, it is
necessary to reect on whether those benets will be equally or equitably distributed.
It is often assumed that the most technologically advanced solution is the best one—
so-called technochauvinism (Broussard, 2018). However, many social issues arise from
underlying social problems and do not warrant technological solutions.
Some common themes emerged throughout this chapter, and we would like to impress
four key points upon the reader:
1. Research in machine learning cannot avoid ethics. Historically, researchers
could focus on fundamental aspects of their work in a controlled laboratory set-
Draft: please send errata to udlbookmail@gmail.com.
434 21 Deep learning and ethics
ting. However, this luxury is dwindling due to the vast economic incentives to
commercialize AI and the degree to which academic work is funded by industry
(see Abdalla & Abdalla, 2021); even theoretical studies may have social impacts,
so researchers must engage with the social and ethical dimensions of their work.
2. Even purely technical decisions can be value-laden. There is still a widely-
held view that AI is fundamentally just mathematics and, therefore, it is “objec-
tive,” and ethics are irrelevant. This assumption is not true when we consider the
creation of AI systems or their deployment.
3. We should question the structures within which AI work takes place.
Much research on AI ethics focuses on specic situations rather than questioning
the larger social structures within which AI will be deployed. For example, there
is considerable interest in ensuring algorithmic fairness, but it may not always be
possible to instantiate conceptions of fairness, justice, or equity within extant social
and political structures. Therefore, technology is inherently political.
4. Social and ethical problems don’t necessarily require technical solutions.
Many potential ethical problems surrounding AI technologies are primarily social
and structural, so technical innovation alone cannot solve these problems; if scien-
tists are to eect positive change with new technology, they must take a political
Problem 21.13
and moral position.
Where does this leave the average scientist? Perhaps with the following imperative:
it is necessary to reect upon the moral and social dimensions of one’s work. This might
require actively engaging those communities that are likely to be most aected by new
technologies, thus cultivating relationships between researchers and communities and em-
powering those communities. Likewise, it might involve engagement with the literature
beyond one’s own discipline. For philosophical questions, the Stanford Encyclopedia of
Philosophy is an invaluable resource. Interdisciplinary conferences are also useful in this
regard. Leading work is published at both the Conference on Fairness, Accountability,
and Transparency (FAccT) and the Conference on AI, Ethics, and Society (AIES).
21.8 Summary
This chapter considered the ethical implications of deep learning and AI. The value
alignment problem is the task of ensuring that the objectives of AI systems are aligned
with human objectives. Bias, explainability, articial moral agency, and other topics can
be viewed through this lens. AI can be intentionally misused, and this chapter detailed
some ways this can happen. Progress in AI has further implications in areas as diverse
as IP law and climate change.
Ethical AI is a collective action problem, and the chapter concludes with an appeal
to scientists to consider the moral and ethical implications of their work. Every ethical
issue is not within the control of every individual computer scientist. However, this does
not imply that researchers have no responsibility whatsoever to consider—and mitigate
where they can—the potential for misuse of the systems they create.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
21.8 Summary 435
Problems
Problem 21.1 It was suggested that the most common specication of the value alignment
problem for AI is “the problem of ensuring that the values of AI systems are aligned with the
values of humanity. Discuss the ways in which this statement of the problem is underspecied.
Discussion Resource: LaCroix (2023).
Problem 21.2 Goodhart’s law states that “when a measure becomes a target, it ceases to be a
good measure. Consider how this law might be reformulated to apply to value alignment for
articial intelligence, given that the loss function is a mere proxy for our true objectives.
Problem 21.3 Suppose a university uses data from past students to build models for predicting
“student success,” where those models can support informed changes in policies and practices.
Consider how biases might aect each of the four stages of the development and deployment of
this model.
Discussion Resource: Fazelpour & Danks (2021).
Problem 21.4 We might think of functional transparency, structural transparency, and run
transparency as orthogonal. Provide an example of how an increase in one form of transparency
may not lead to a concomitant increase in another form of transparency.
Discussion Resource: Creel (2020).
Problem 21.5 If a computer scientist writes a research paper on AI or pushes code to a public
repository, do you consider them responsible for future misuse of their work?
Problem 21.6 To what extent do you think the militarization of AI is inevitable?
Problem 21.7 In light of the possible misuse of AI highlighted in section 21.2, make arguments
both for and against the open-source culture of research in deep learning.
Problem 21.8 Some have suggested that personal data is a source of power for those who own it.
Discuss the ways personal data is valuable to companies that utilize deep learning and consider
the claim that losses to privacy are experienced collectively rather than individually.
Discussion Resource: Véliz (2020).
Problem 21.9 What are the implications of generative AI for the creative industries? How do
you think IP laws should be modied to cope with this new development?
Problem 21.10 A good forecast must (i) be specic enough to know when it is wrong, (ii)
account for possible cognitive biases, and (iii) allow for rationally updating beliefs. Consider
any claim in the recent media about future AI and discuss whether it satises these criteria.
Discussion Resource: Tetlock & Gardner (2016).
Problem 21.11 Some critics have argued that calls to democratize AI have focused too heavily on
the participatory aspects of democracy, which can increase risks of errors in collective perception,
reasoning, and agency, leading to morally-bad outcomes. Reect on each of the following: What
aspects of AI should be democratized? Why should AI be democratized? How should AI be
democratized?
Discussion Resource: Himmelreich (2022).
Problem 21.12 In March 2023, the Future of Life Institute published a letter, “Pause Giant AI
Experiments,” in which they called on all AI labs to immediately pause for at least six months
the training of AI systems more powerful than GPT-4. Discuss the motivations of the authors
in writing this letter, the public reaction, and the implications of such a pause. Relate this
episode to the view that AI ethics can be considered a collective action problem (section 21.6).
Discussion Resource: Gebru et al. (2023).
Problem 21.13 Discuss the merits of the four points in section 21.7. Do you agree with them?
Draft: please send errata to udlbookmail@gmail.com.
Appendix A
Notation
This appendix details the notation used in this book. This mostly adheres to standard
conventions in computer science, but deep learning is applicable to many dierent areas,
so it is explained in full. In addition, there are several notational conventions that
are unique to this book, including notation for functions and the systematic distinction
between parameters and variables.
Scalars, vectors, matrices, and tensors
Scalars are denoted by either small or capital letters a, A, α. Column vectors (i.e., 1D ar-
rays of numbers) are denoted by small bold letters a, ϕ, and row vectors as the transpose
of column vectors a
T
, ϕ
T
. Matrices and tensors (i.e., 2D and ND arrays of numbers,
respectively) are both represented by bold capital letters B, Φ.
Variables and parameters
Variables (usually the inputs and outputs of functions or intermediate calculations) are
always denoted by Roman letters a, b, C. Parameters (which are internal to functions
or probability distributions) are always denoted by Greek letters α, β, Γ. Generic, un-
specied parameters are denoted by ϕ. This distinction is retained throughout the book
except for the policy in reinforcement learning, which is denoted by π according to the
usual convention.
Sets
Sets are denoted by curly brackets, so {0, 1, 2} denotes the numbers 0, 1, and 2. The
notation {0, 1, 2, . . .} denotes the set of non-negative integers. Sometimes, we want to
specify a set of variables and {x
i
}
I
i=1
denotes the I variables x
1
, . . . x
I
. When it’s not
necessary to specify how many items are in the set, this is shortened to {x
i
}. The
notation {x
i
, y
i
}
I
i=1
denotes the set of I pairs x
i
, y
i
. The convention for naming sets is
to use calligraphic letters. Notably, B
t
is used to denote the set of indices in a batch at
iteration t during training. The number of elements in a set S is denoted by |S|.
The set R denotes the set of real numbers. The set R
+
denotes the set of non-negative
real numbers. The notation R
D
denotes the set of D-dimensional vectors containing real
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
437
numbers. The notation R
D
1
×D
2
denotes the set of matrices of dimension D
1
×D
2
. The
notation R
D
1
×D
2
×D
3
denotes the set of tensors of size D
1
× D
2
× D
3
and so on.
The notation [a, b] denotes the real numbers from a to b, including a and b themselves.
When the square brackets are replaced by round brackets, this means that the adjacent
value is not included in the set. For example, the set (π, π] denotes the real numbers
from π to π, but excluding π.
Membership of sets is denoted by the symbol , so x R
+
means that the variable x
is a non-negative real number, and the notation Σ R
D×D
denotes that Σ is a matrix
of size D ×D. Sometimes, we want to work through each element of a set systematically,
and the notation {1, . . . , K} means “for all” the integers from 1 to K.
Functions
Functions are expressed as a name, followed by square brackets that contain the argu-
ments of the function. For example, log[x] returns the logarithm of the variable x. When
the function returns a vector, it is written in bold and starts with a small letter. For
example, the function y = mlp[x, ϕ] returns a vector y and has vector arguments x
and
ϕ
. When a function returns a matrix or tensor, it is written in bold and starts with
a capital letter. For example, the function Y = Sa[X, ϕ] returns a matrix Y and has
arguments X and ϕ. When we want to leave the arguments of a function deliberately
ambiguous, we use the bullet symbol (e.g., mlp[, ϕ]).
Minimizing and maximizing
Some special functions are used repeatedly throughout the text:
The function min
x
[f[x]] returns the minimum value of the function f[x] over all
possible values of the variable x. This notation is often used without specifying
the details of how this minimum might be found.
The function argmin
x
[f[x]] returns the value of x that minimizes f[x], so if y =
argmin
x
[f[x]], then min
x
[f[x]] = f[y].
The functions max
x
[f[x]] and argmax
x
[f[x]] perform the equivalent operations for
maximizing functions.
Probability distributions
Probability distributions should be written as P r(x = a), denoting that the random
variable x takes the value of a. However, this notation is cumbersome. Hence, we
usually simplify this and just write P r(x), where x denotes either the random variable
or the value it takes according to the sense of the equation. The conditional probability
of y given x is written as P r(y|x). The joint probability of y and x is written as P r(y, x).
These two forms can be combined, so P r(y|x, ϕ) denotes the probability of the variable y,
given that we know x and ϕ. Similarly, P r(y, x|ϕ) denotes the probability of variables y
and x given that we know ϕ. When we need two probability distributions over the
same variable, we write P r(x) for the rst distribution and q(x) for the second. More
information about probability distributions can be found in appendix C.
Draft: please send errata to udlbookmail@gmail.com.
438 A Notation
Asymptotic notation
Asymptotic notation is used to compare the amount of work done by dierent algorithms
as the size D of the input increases. This can be done in various ways, but this book only
uses big-O notation, which represents an upper bound on the growth of computation in
an algorithm. A function f[n] is O[g[n]] if there exists a constant c > 0 and integer n
0
such that f[n] < c ·g[n] for all n > n
0
.
This notation provides a bound on the worst-case running time of an algorithm.
For example, when we say that inversion of a D ×D matrix is O[D
3
], we mean that the
computation will increase no faster than some constant times D
3
once D is large enough.
This gives us an idea of how feasible it is to invert matrices of dierent sizes. If D = 10
3
,
then it may take of the order of 10
9
operations to invert it.
Miscellaneous
A small dot in a mathematical equation is intended to improve ease of reading and
has no real meaning (or just implies multiplication). For example, α · f[x] is the same
as αf[x] but is easier to read. To avoid ambiguity, dot products are written as a
T
b (see
appendix B.3.4). A left arrow symbol denotes assignment, so x x + 2 means that
we are adding two to the current value of x.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Appendix B
Mathematics
This appendix reviews mathematical concepts that are used in the main text.
B.1 Functions
A function denes a mapping from a set X (e.g., the set of real numbers) to another
set Y. An injection is a one-to-one function where every element in the rst set maps to
a unique position in the second set (but there may be elements of the second set that
are not mapped to). A surjection is a function where every element in the second set
receives a mapping from the rst (but there may be elements of the rst set that are
not mapped). A bijection or bijective mapping is a function that is both injective and
surjective. It provides a one-to-one correspondence between all members of the two sets.
A dieomorphism is a special case of a bijection where both the forward and reverse
mapping are dierentiable.
B.1.1 Lipschitz constant
A function f[z] is Lipschitz continuous if for all z
1
, z
2
:
||f[z
1
] f[z
2
]|| β||z
1
z
2
||, (B.1)
where β is known as the Lipschitz constant and determines the maximum gradient of
the function (i.e., how fast the function can change) with respect to the distance metric.
If the Lipschitz constant is less than one, the function is a contraction mapping, and we
can use Banach’s theorem to nd the inverse for any point (see gure 16.9).
Composing two functions with Lipschitz constants β
1
and β
2
creates a new Lipschitz
continuous function with a constant that is less than or equal to β
1
β
2
. Adding two
functions with Lipschitz constants β
1
and β
2
creates a new Lipschitz continuous func-
tion with a constant that is less than or equal to β
1
+β
2
. The Lipschitz constant of a
linear transformation f[z] = Az+b with respect to a Euclidean distance measure is the
maximum eigenvalue of A.
Draft: please send errata to udlbookmail@gmail.com.
440 B Mathematics
B.1.2 Convexity
A function is convex if we can draw a straight line between any two points on the
function, and this line always lies above the function. Similarly, a function is concave
if a straight line between any two points always lies below the function. By denition,
convex (concave) functions have at most one minimum (maximum).
A region of R
D
is convex if we can draw a straight line between any two points on the
boundary of the region without intersecting the boundary in another place. Gradient
descent guarantees to nd the global minimum of any function that is both convex and
dened on a convex region.
B.1.3 Special functions
The following functions are used in the main text:
The exponential function y = exp[x] (gure B.1a) maps a real variable x R to a
non-negative number y R
+
as y = e
x
.
The logarithm x = log[y] (gure B.1b) is the inverse of the exponential function
and maps a non-negative number y R
+
to a real variable x R. Note that all
logarithms in this book are natural (i.e., in base e).
The gamma function Γ[x] (gure B.1c) is dened as:
Γ[x] =
Z
0
t
x1
e
t
dt. (B.2)
This extends the factorial function to continuous values so that Γ[x] = (x 1)! for
x {1, 2, . . .}.
The Dirac delta function δ[z] has a total area of one, all of which is at position z = 0.
A dataset with N elements can be thought of as a probability distribution consisting
of a sum of N delta functions centered at each data point x
i
and scaled by 1/N.
The delta function is usually drawn as an arrow (e.g., gure 5.12). The delta
function has the key property that:
Z
f[x]δ[x x
0
]dx = f[x
0
]. (B.3)
B.1.4 Stirling’s formula
Stirling’s formula (gure B.2) approximates the factorial function (and hence the Gamma
function) using the formula:
x!
2πx
x
e
x
. (B.4)
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
B.2 Binomial coecients 441
Figure B.1 Exponential, logarithm, and gamma functions. a) The exponential
function maps a real number to a positive number. It is a convex function. b)
The logarithm is the inverse of the exponential and maps a positive number to a
real number. It is a concave function. c) The Gamma function is a continuous
extension of the factorial function so that Γ[x] = (x 1)! for x {1, 2, . . .}.
Figure B.2 Stirling’s formula. The facto-
rial function x! can be approximated by
Stirling’s formula Stir[x] which is dened
for every real value.
B.2 Binomial coecients
Binomial coecients are written as
n
k
and pronounced as “n choose k. They are
positive integers that represent the number of ways of choosing an unordered subset of k
items from a set of
n
items without replacement. Binomial coecients can be computed
using the simple formula:
n
k
=
n!
k!(n k)!
. (B.5)
B.2.1 Autocorrelation
The autocorrelation r[τ ] of a continuous function f[z] is dened as:
r[τ] =
Z
−∞
f[t + τ]f[t]dt, (B.6)
Draft: please send errata to udlbookmail@gmail.com.
442 B Mathematics
where τ is the time lag. Sometimes, this is normalized by r[0] so that the autocorrelation
at time lag zero is one. The autocorrelation function is a measure of the correlation of the
function with itself as a function of an oset (i.e., the time lag). If a function changes
slowly and predictably, then the autocorrelation function will decrease slowly as the
time lag increases from zero. If the function changes fast and unpredictably, then it will
decrease quickly to zero.
B.3 Vector, matrices, and tensors
In machine learning, a vector x R
D
is a one-dimensional array of D numbers, which
we will assume are organized in a column. Similarly, a matrix Y R
D
1
×D
2
is a two-
dimensional array of numbers with D
1
rows and D
2
columns. A tensor z R
D
1
×D
2
...×D
N
is an N-dimensional array of numbers. Confusingly, all three of these quantities are stored
in objects known as “tensors” in deep learning APIs such as PyTorch and TensorFlow.
B.3.1 Transpose
The transpose A
T
R
D
2
×D
1
of a matrix A R
D
1
×D
2
is formed by reecting it around
the principal diagonal so that the k
th
column becomes the k
th
row and vice-versa. If we
take the transpose of a matrix product AB, then we take the transpose of the original
matrices but reverse the order so that
(AB)
T
= B
T
A
T
. (B.7)
The transpose of a column vector a is a row vector a
T
and vice-versa.
B.3.2 Vector and matrix norms
For a vector z, the
p
norm is dened as:
||z||
p
=
D
X
d=1
|z
d
|
p
!
1/p
. (B.8)
When p = 2, this returns the length of the vector, and this is known as the Euclidean
norm. It is this case that is most commonly used in deep learning, and often the ex-
ponent p is omitted, and the Euclidean norm is just written as ||z||. When p = , the
operator returns the maximum absolute value in the vector.
Norms can be computed in a similar way for matrices. For example, the
2
norm of
a matrix Z (known as the Frobenius norm) is calculated as:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
B.3 Vector, matrices, and tensors 443
||Z||
F
=
I
X
i=1
J
X
j=1
|z
ij
|
2
1/2
. (B.9)
B.3.3 Product of matrices
The product C = AB of two matrices A R
D
1
×D
2
and B R
D
2
×D
3
is a third ma-
trix C R
D
1
×D
3
where:
C
ij
=
D
2
X
d=1
A
id
B
dj
. (B.10)
B.3.4 Dot product of vectors
The dot product a
T
b of two vectors a R
D
and b R
D
is a scalar and is dened as:
a
T
b = b
T
a =
D
X
d=1
a
d
b
d
. (B.11)
It can be shown that the dot product is proportional to the Euclidean norm of the rst
vector times the Euclidean norm of the second vector times the angle θ between them:
a
T
b = ||a||||b|| cos[θ]. (B.12)
B.3.5 Inverse
A square matrix A may or may not have an inverse A
1
such that A
1
A = AA
1
= I.
If a matrix does not have an inverse, it is called singular. If we take the inverse of a
matrix product AB then we can equivalently take the inverse of each matrix individually
and reverse the order of multiplication.
(AB)
1
= B
1
A
1
. (B.13)
In general, it takes O[D
3
] operations to invert a D ×D matrix. However, inversion is
more ecient for special types of matrices, including diagonal, orthogonal, and triangular
matrices (see section B.4).
B.3.6 Subspaces
Consider a matrix A R
D
1
×D
2
. If the number of columns D
2
of the matrix is fewer than
the number of rows D
1
(i.e., the matrix is “portrait”), the product Ax cannot reach all
Draft: please send errata to udlbookmail@gmail.com.
444 B Mathematics
Figure B.3 Eigenvalues. When the
points {x
i
} on the unit circle are trans-
formed to points {x
i
} by a linear trans-
formation x
i
= Ax
i
, they are mapped
to an ellipse. For example, the light blue
point on the unit circle is mapped to
the light blue point on the ellipse. The
length of the major (longest) axis of the
ellipse (long gray arrow) is the magni-
tude of the rst eigenvalue of the matrix,
and the length of the minor (shortest)
axis of the ellipse (short gray arrow) is
the magnitude of the second eigenvalue.
possible positions in the D
1
-dimensional output space. This product consists of the D
2
columns of A weighted by the D
2
elements of x and can only reach the linear subspace
that is spanned by these columns. This is known as the column space of the matrix.
Conversely, for a landscape matrix A, the part of the input space that maps to zero (i.e.,
those x where Ax = 0) is termed the nullspace of the matrix.
B.3.7 Eigenspectrum
If we multiply the set of 2D points on a unit circle by a 2 ×2 matrix A, they map to an
ellipse (gure B.3). The radii of the major and minor axes of this ellipse (i.e., the longest
and shortest directions) correspond to the magnitude of the eigenvalues λ
1
and λ
2
of the
matrix. The eigenvalues also have a sign, which relates to whether the matrix reects the
inputs about the origin. The same idea applies in higher dimensions. A Ddimensional
spheroid is mapped by a D × D matrix A to a D-dimensional ellipsoid. The radii of
the D principal axes of this ellipsoid determine the magnitude of the eigenvalues.
The spectral norm of a square matrix is the largest absolute eigenvalue. It captures
the largest possible change in magnitude when the matrix is applied to a vector of unit
length. As such, it tells us about the Lipschitz constant of the transformation. The set of
eigenvalues is sometimes called the eigenspectrum and tells us about the magnitude of the
scaling applied by the matrix across all directions. This information can be summarized
using the determinant and trace of the matrix.
B.3.8 Determinant and trace
Every square matrix A has a scalar associated with it called the determinant and denoted
by |A| or det[A], which is the product of the eigenvalues. It is hence related to the
average scaling applied by the matrix for dierent inputs. Matrices with small absolute
determinants tend to decrease the norm of vectors upon multiplication. Matrices with
large absolute determinants tend to increase the norm. If a matrix is singular, the
determinant will be zero, and there will be at least one direction in space that is mapped
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
B.4 Special types of matrix 445
to the origin when the matrix is applied. Determinants of matrix expressions obey the
following rules:
|A
T
| = |A|
|AB| = |A||B|
|A
1
| = 1/|A|. (B.14)
The trace of a square matrix is the sum of the diagonal values (the matrix itself need
not be diagonal) or the sum of the eigenvalues. Traces obey these rules:
trace[A
T
] = trace[A]
trace[AB] = trace[BA]
trace[A + B] = trace[A] + trace[B]
trace[ABC] = trace[BCA] = trace[CAB], (B.15)
where in the last relation, the trace is invariant for cyclic permutations only, so in
general, trace[ABC] = trace[BAC].
B.4 Special types of matrix
Calculating the inverse of a square matrix A R
D×D
has a complexity of O[D
3
], as does
the computation of the determinant. However, for some matrices with special properties,
these computations can be more ecient.
B.4.1 Diagonal matrices
A diagonal matrix has zeros everywhere except on the principal diagonal. If these diag-
onal entries are all non-zero, the inverse is also a diagonal matrix, with each diagonal
entry d
ii
replaced by 1/d
ii
. The determinant is the product of the values on the di-
agonal. A special case of this is the identity matrix, which has ones on the diagonal.
Consequently, its inverse is also the identity matrix, and its determinant is one.
B.4.2 Triangular matrices
A lower triangular matrix contains only non-zero values on the principal diagonal and
the positions below this. An upper triangular matrix contains only non-zero values on
the principal diagonal and the positions above this. In both cases, the matrix can be
inverted in O[D
2
] (see problem 16.4), and the determinant is just the product of the
values on the diagonal.
Draft: please send errata to udlbookmail@gmail.com.
446 B Mathematics
B.4.3 Orthogonal matrices
Orthogonal matrices represent rotations and reections around the origin, so in gure B.3,
the circle would be mapped to another circle of unit radius but rotated and possibly
reected. Accordingly, the eigenvalues must all have magnitude one, and the determinant
must be either one or minus one. The inverse of an orthogonal matrix is its transpose,
so A
1
= A
T
.
B.4.4 Permutation matrices
A permutation matrix has exactly one non-zero entry in each row and column, and all
of these entries take the value one. It is a special case of an orthogonal matrix, so its
inverse is its own transpose, and its determinant is always one. As the name suggests, it
has the eect of permuting the entries of a vector. For example:
0 1 0
0 0 1
1 0 0
a
b
c
=
b
c
a
. (B.16)
B.4.5 Linear algebra
Linear algebra is the mathematics of linear functions, which have the form:
f[z
1
, z
2
, . . . z
D
] = ϕ
1
z
1
+ ϕ
2
z
2
+ . . . ϕ
D
z
D
, (B.17)
where ϕ
1
, . . . , ϕ
D
are parameters that dene the function. We often add a constant
term ϕ
0
to the right-hand side. This is technically an ane function but is commonly
referred to as linear in machine learning. We adopt this convention throughout.
B.4.6 Linear equations in matrix form
Consider a collection of linear functions:
y
1
= ϕ
10
+ ϕ
11
z
1
+ ϕ
12
z
2
+ ϕ
13
z
3
y
2
= ϕ
20
+ ϕ
21
z
1
+ ϕ
22
z
2
+ ϕ
23
z
3
y
3
= ϕ
30
+ ϕ
31
z
1
+ ϕ
32
z
2
+ ϕ
33
z
3
. (B.18)
These can be written in matrix form as:
y
1
y
2
y
3
=
ϕ
10
ϕ
20
ϕ
30
+
ϕ
11
ϕ
12
ϕ
13
ϕ
21
ϕ
22
ϕ
23
ϕ
31
ϕ
32
ϕ
33
z
1
z
2
z
3
, (B.19)
or as y = ϕ
0
+ Φz for short, where y
i
= ϕ
i0
+
P
3
j=1
ϕ
ij
z
j
.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
B.5 Matrix calculus 447
B.5 Matrix calculus
Most readers of this book will be accustomed to the idea that if we have a function
y = f[x], we can compute the derivative y/∂x, and this represents how y changes when
we make a small change in x. This idea extends to functions y = f[x] mapping a vector
x to a scalar y, functions y = f[x] mapping a vector x to a vector y, functions y = f[X]
mapping a matrix X to a vector y, and so on. The rules of matrix calculus help us
compute derivatives of these quantities. The derivatives take the following forms:
For a function y = f[x] where y R and x R
D
, the derivative y/∂x is also a
D-dimensional vector, where the i
th
element is computed as y/∂x
i
.
For a function y = f[x] where y R
D
y
and x R
D
x
, the derivative y/∂x is
a D
x
× D
y
matrix where element (i, j) contains the derivative y
j
/∂x
i
. This is
known as a Jacobian and is sometimes written as
x
y in other documents.
For a function y = f[X] where y R
D
y
and X R
D
1
×D
2
, the derivative y/∂X
is a 3D tensor containing the derivatives y
i
/∂x
jk
.
Often these matrix and vector derivatives have supercially similar forms to the scalar
case. For example, we have:
y = ax
y
x
= a, (B.20)
and
y = Ax
y
x
= A
T
. (B.21)
Draft: please send errata to udlbookmail@gmail.com.
Appendix C
Probability
Probability is critical to deep learning. In supervised learning, deep networks implic-
itly rely on a probabilistic formulation of the loss function. In unsupervised learning,
generative models aim to produce samples that are drawn from the same probability
distribution as the training data. Reinforcement learning occurs within Markov decision
processes, and these are dened in terms of probability distributions. This appendix
provides a primer for probability as used in machine learning.
C.1 Random variables and probability distributions
A random variable x denotes a quantity that is uncertain. It may be discrete (take only
certain values, for example integers) or continuous (take any value on a continuum, for
example real numbers). If we observe several instances of a random variable x, it will
take dierent values, and the relative propensity to take dierent values is described by
a probability distribution P r(x).
For a discrete variable, this distribution associates a probability P r(x=k) [0, 1] with
each potential outcome k, and the sum of these probabilities is one. For a continuous
variable, there is a non-negative probability density P r(x = a) 0 associated with each
value a in the domain of x, and the integral of this probability density function (PDF)
over this domain must be one. This density can be greater than one for any point a.
From here on, we assume that the random variables are continuous. The ideas are exactly
the same for discrete distributions but with sums replacing integrals.
C.1.1 Joint probability
Consider the case where we have two random variables x and y. The joint distribu-
tion P r(x, y) tells us about the propensity that x and y take particular combinations of
values (gure C.1a). Now there is a non-negative probability density P r(x = a, y = b)
associated with each pair of values x = a and y = b and this must satisfy:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
C.1 Random variables and probability distributions 449
Figure C.1 Joint and marginal distribu-
tions. a) The joint distribution P r(x, y)
captures the propensity of variables x
and y to take dierent combinations of
values. Here, the probability density is
represented by the color map, so brighter
positions are more probable. For exam-
ple, the combination x =6, y =6 is much
less likely to be observed than the com-
bination x = 5, y = 0. b) The marginal
distribution P r(x) of variable x can be
recovered by integrating over y. c) The
marginal distribution P r(y) of variable y
can be recovered by integrating over x.
ZZ
P r(x, y) · dxdy = 1. (C.1)
This idea extends to more than two variables, so the joint density of x, y, and z is written
as Pr(x, y, z). Sometimes, we store multiple random variables in a vector x, and we write
their joint density as P r(x). Extending this, we can write the joint density of all of the
variables in two vectors x and y as P r(x, y).
C.1.2 Marginalization
If we know the joint distribution Pr(x, y) over two variables, we can recover the marginal
distributions P r(x) and P r(y) by integrating over the other variable (gure C.1b-c):
Z
P r(x, y) · dx = P r(y)
Z
P r(x, y) · dy = P r(x). (C.2)
This process is called marginalization and has the interpretation that we are comput-
ing the distribution of one variable regardless of the value the other one took. The
idea of marginalization extends to higher dimensions, so if we have a joint distribu-
tion P r(x, y, z), we can recover the joint distribution P r(x, z) by integrating over y.
C.1.3 Conditional probability and likelihood
The conditional probability P r(x|y) is the probability of variable x taking a certain value,
assuming we know the value of y. The vertical line is read as the English word “given,”
Draft: please send errata to udlbookmail@gmail.com.
450 C Probability
Figure C.2 Conditional distributions. a) Joint distribution P r(x, y) of variables x
and y. b) The conditional probability P r(x|y = 3.0) of variable x, given that y
takes the value 3.0, is found by taking the horizontal “slice” P r(x, y = 3.0) of
the joint probability (top cyan line in panel a), and dividing this by the total
area P r(y = 3.0) in that slice so that it forms a valid probability distribution
that integrates to one. c) The joint probability P r(x, y = 1.0) is found similarly
using the slice at y =1.0.
so P r(x|y) is the probability of x given y. The conditional probability P r(x|y) can be
found by taking a slice through the joint distribution P r(x, y) for a xed y. This slice is
then divided by the probability of that value y occurring (the total area under the slice)
so that the conditional distribution sums to one (gure C.2):
P r(x|y) =
P r(x, y)
P r(y)
. (C.3)
Similarly,
P r(y|x) =
P r(x, y)
P r(x)
. (C.4)
When we consider the conditional probability P r(x|y) as a function of x, it must sum
to one. When we consider the same quantity P r(x|y) as a function of y, it is termed the
likelihood of x given y and does not have to sum to one.
C.1.4 Bayes’ rule
From equations C.3 and C.4, we get two expressions for the joint probability P r(x, y):
P r(x, y) = P r(x|y)P r(y) = P r(y|x)P r(x), (C.5)
which we can rearrange to get:
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
C.1 Random variables and probability distributions 451
Figure C.3 Independence. a) When two variables x and y are independent, the
joint distribution factors into the product of marginal distributions, so P r(x, y) =
P r(x)P r(y). Independence implies that knowing the value of one variable tells
us nothing about the other. b–c) Accordingly, all of the conditional distribu-
tions P r(x|y = ) are the same and are equal to the marginal distribution P r(x).
P r(x|y) =
P r(y|x)P r(x)
P r(y)
. (C.6)
This expression relates the conditional probability P r(x|y) of x given y to the conditional
probability P r(y|x) of y given x and is known as Bayes’ rule.
Each term in this Bayes’ rule has a name. The term P r(y|x) is the likelihood of y
given x, and the term P r(x) is the prior probability of x. The denominator P r(y) is
known as the evidence, and the left-hand side P r(x|y) is termed the posterior probability
of x given y. The equation maps from the prior P r(x) (what we know about x before
observing y) to the posterior P r(x|y) (what we know about x after observing y).
C.1.5 Independence
If the value of the random variable y tells us nothing about x and vice-versa, we say
that x and y are independent, and we can write P r(x|y) = P r(x) and P r(y|x) = P r(y).
It follows that all of the conditional distributions P r(y|x = ) are identical, as are the
conditional distributions P r(x|y = ).
Starting from the rst expression for the joint probability in equation C.5, we see
that the joint distribution becomes the product of the marginal distributions:
P r(x, y) = P r(x|y)P r(y) = P r(x)P r(y) (C.7)
when the variables are independent (gure C.3).
Draft: please send errata to udlbookmail@gmail.com.
452 C Probability
C.2 Expectation
Consider a function f[x] and a probability distribution P r(x) dened over x. The ex-
pected value of a function f[] of a random variable x with respect to the probability
distribution P r(x) is dened as:
E
x
f[x]
=
Z
f[x]P r(x)dx. (C.8)
As the name suggests, this is the expected or average value of f[x] after taking into account
the probability of seeing dierent values of x. This idea generalizes to functions f[, ] of
more than one random variable:
E
x,y
f[x, y]
=
ZZ
f[x, y]P r(x, y)dxdy. (C.9)
An expectation is always taken with respect to a distribution over one or more variables.
However, we don’t usually make this explicit when the choice of distribution is obvious
and write E[f[x]] instead of E
x
[f[x]].
If we drew a large number I of samples {x
i
}
I
i=1
from P r(x), calculated f[x
i
] for
each sample and took the average of these values, the result would approximate the
expectation E[f[x]] of the function:
E
x
f[x]
1
I
I
X
i=1
f[x
i
]. (C.10)
C.2.1 Rules for manipulating expectations
There are four rules for manipulating expectations:
E
k
= k
E
k · f[x]
= k · E
f[x]
E
f[x] + g[x]
= E
f[x]
+ E
g[x]
E
x,y
f[x] · g[y]
= E
x
f[x]
· E
y
g[y]
if x, y independent,
(C.11)
where k is an arbitrary constant. These are proven below for the continuous case.
Rule 1: The expectation E[k] of a constant value k is just k.
E
k
=
Z
k · P r(x)dx
= k ·
Z
P r(x)dx
= k.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
C.2 Expectation 453
Rule 2: The expectation E[k · f[x]] of a constant k times a function of the variable x
is k times the expectation E[f[x]] of the function:
E
k · f[x]
=
Z
k · f[x]P r(x)dx
= k ·
Z
f[x]P r(x)dx
= k · E
f[x]
.
Rule 3: The expectation of a sum E[f[x] + g[x]] of terms is the sum E[f[x]] + E[g[x]] of
the expectations:
E
f[x] + g[x]
=
Z
(f[x] + g[x]) · P r(x)dx
=
Z
f[x] · Pr(x) + g[x] · Pr(x)
dx
=
Z
f[x] · Pr(x)dx +
Z
g[x] · Pr(x)dx
= E
f[x]
+ E
g[x]
.
Rule 4: The expectation of a product E[f[x] ·g[y]] of terms is the product E[f[x]] ·E[g[y]]
if x and y are independent.
E
f[x] · g[y]
=
ZZ
f[x] · g[y]P r(x, y)dxdy
=
ZZ
f[x] · g[y]P r(x)P r(y)dxdy
=
Z
f[x] · Pr(x)dx
Z
g[y] · Pr(y)dy
=
E
f
[
x
]
E
g
[
y
]
,
where we used the denition of independence (equation C.7) between the rst two lines.
The four rules generalize to the multivariate case:
E
A
= A
E
A · f[x]
= AE
f[x]
E
f[x] + g[x]
= E
f[x]
+ E
g[x]
E
x,y
f[x]
T
g[y]
= E
x
f[x]
T
E
y
g[y]
if x, y independent, (C.12)
where now A is a constant matrix and f[x] is a function of the vector x that returns a
vector, and g[y] is a function of the vector y that also returns a vector.
Draft: please send errata to udlbookmail@gmail.com.
454 C Probability
C.2.2 Mean, variance, and covariance
For some choices of function f[], the expectation is given a special name. These quan-
tities are often used to summarize the properties of complex distributions. For example,
when f[x] = x, the resulting expectation E[x] is termed the mean, µ. It is a mea-
sure of the center of a distribution. Similarly, the expected squared deviation from the
mean E[(x µ)
2
] is termed the variance, σ
2
. This is a measure of the spread of the
distribution. The standard deviation σ is the positive square root of the variance. It
also measures the spread of the distribution but has the merit that it is expressed in the
same units as the variable x.
As the name suggests, the covariance E[(x µ
x
)(y µ
y
)] of two variables x and y
measures the degree to which they co-vary. Here µ
x
and µ
y
represent the mean of the
variables x and y, respectively. The covariance will be large when the variance of both
variables is large and when the value of x tends to increase when the value of y increases.
If two variables are independent, then their covariance is zero. However, a covariance
of zero does not imply independence. For example, consider a distribution Pr(x, y) where
the probability is uniformly distributed on a circle of radius one centered on the origin
of the
x, y
plane. There is no tendency on average for
x
to increase when
y
increases or
vice-versa. However, knowing the value of x = 0 tells us that y has an equal chance of
taking the values ±1, so the variables cannot be independent.
The covariances of multiple random variables stored in a column vector x R
D
can be
represented by the D×D covariance matrix E[(x µ
x
)(x µ
x
)
T
], where the vector µ
x
contains the means E[x]. The element at position (i, j) of this matrix represents the
covariance between variables x
i
and x
j
.
C.2.3 Variance identity
The rules of expectation (appendix C.2.1) can be used to prove the following identity
that allows us to write the variance in a dierent form:
E
(x µ)
2
= E
x
2
E
x
2
. (C.13)
Proof:
E
(x µ)
2
= E
x
2
2µx + µ
2
= E
x
2
E
2µx
+ E
µ
2
= E
x
2
2µ ·E
x
+ µ
2
= E
x
2
2µ
2
+ µ
2
= E
x
2
µ
2
= E
x
2
E
x
2
, (C.14)
where we have used rule 3 between lines 1 and 2, rules 1 and 2 between lines 2 and 3,
and the denition µ = E[x] in the remaining two lines.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
C.2 Expectation 455
C.2.4 Standardization
Setting the mean of a random variable to zero and the variance to one is known as
standardization. This is achieved using the transformation:
z =
x µ
σ
, (C.15)
where µ is the mean of x and σ is the standard deviation.
Proof: The mean of the new distribution over z is given by:
E[z] = E
x µ
σ
=
1
σ
E
x µ
=
1
σ
E
x
E
µ

=
1
σ
(µ µ) = 0, (C.16)
where again, we have used the four rules for manipulating expectations. The variance of
the new distribution is given by:
E
(z µ
z
)
2
= E
(z E[z])
2
= E
z
2
= E
"
x µ
σ
2
#
=
1
σ
2
· E[(x µ)
2
]
=
1
σ
2
· σ
2
= 1. (C.17)
By a similar argument, we can take a standardized variable z with mean zero and unit
variance and convert it to a variable x with mean µ and variance σ
2
using:
x = µ + σz. (C.18)
In the multivariate case, we can standardize a variable x with mean µ and covariance
matrix Σ using:
z = Σ
1/2
(x µ). (C.19)
The result will have a mean E[z] = 0 and an identity covariance matrix E[(z E[z])(z
E[z])
T
] = I. To reverse this process, we use:
x = µ + Σ
1/2
z. (C.20)
Draft: please send errata to udlbookmail@gmail.com.
456 C Probability
C.3 Normal probability distribution
Probability distributions used in this book include the Bernoulli distribution (gure 5.6),
categorical distribution (gure 5.9), Poisson distribution (gure 5.15), von Mises distri-
bution (gure 5.13), and mixture of Gaussians (gures 5.14 and 17.1). However, the
most common distribution in machine learning is the normal or Gaussian distribution.
C.3.1 Univariate normal distribution
A univariate normal distribution (gure 5.3) over scalar variable x has two parameters,
the mean µ and the variance σ
2
, and is dened as:
P r(x) = Norm
x
[µ, σ
2
] =
1
2πσ
2
exp
(x µ)
2
2σ
2
. (C.21)
Unsurprisingly, the mean E[x] of a normally distributed variable is given by the mean
parameter µ and the variance E[(x E[x])
2
] by the variance parameter σ
2
. When the
mean is zero and the variance is one, we refer to this as a standard normal distribution.
The shape of the normal distribution can be inferred from the following argument.
The term (xµ)
2
/2σ
2
is a quadratic function that falls away from zero when x = µ at a
rate that increases when σ becomes smaller. When we pass this through the exponential
function (gure B.1), we get a bell-shaped curve, which has a value of one at x = µ
and falls away to either side. Dividing by the constant
2πσ
2
ensures that the function
integrates to one and is a valid distribution. It follows from this argument that the
mean µ control the position of the center of the bell curve, and the square root σ of the
variance (the standard deviation) controls the width of the bell curve.
C.3.2 Multivariate normal distribution
The multivariate normal distribution generalizes the normal distribution to describe the
probability over a vector quantity x of length D. It is dened by a D ×1 mean vector µ
and a symmetric positive denite D × D covariance matrix Σ:
Norm
x
[µ, Σ] =
1
(2π)
D/2
|Σ|
1/2
exp
(x µ)
T
Σ
1
(x µ)
2
. (C.22)
The interpretation is similar to the univariate case. The quadratic term (xµ)
T
Σ
1
(x
µ)/2 returns a scalar that decreases as x grows further from the mean µ, at a rate that
depends on the matrix Σ. This is turned into a bell-curve shape by the exponential, and
dividing by (2π)
D/2
|Σ|
1/2
ensures that the distribution integrates to one.
The covariance matrix can take spherical, diagonal, and full forms:
Σ
spher
=
σ
2
0
0 σ
2
Σ
diag
=
σ
2
1
0
0 σ
2
2
Σ
full
=
σ
2
11
σ
2
12
σ
2
21
σ
2
22
. (C.23)
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
C.3 Normal probability distribution 457
Figure C.4 Bivariate normal distribution. a–b) When the covariance matrix is a
multiple of the diagonal matrix, the isocontours are circles, and we refer to this as
spherical covariance. c–d) When the covariance is an arbitrary diagonal matrix,
the isocontours are axis-aligned ellipses, and we refer to this as diagonal covariance
e–f) When the covariance is an arbitrary symmetric positive denite matrix, the
iso-contours are general ellipses, and we refer to this as full covariance.
In two dimensions (gure C.4), spherical covariances produce circular iso-density
contours, and diagonal covariances produce ellipsoidal iso-contours that are aligned with
the coordinate axes. Full covariances produce general ellipsoidal iso-density contours.
When the covariance is spherical or diagonal, the individual variables are independent:
P r(x
1
, x
2
) =
1
2π
p
|Σ|
exp
0.5
x
1
x
2
Σ
1
x
1
x
2

=
1
2πσ
1
σ
2
exp
0.5
x
1
x
2
σ
2
1
0
0 σ
2
2
x
1
x
2

=
1
p
2πσ
2
1
exp
x
2
1
2σ
2
1
·
1
p
2πσ
2
2
exp
x
2
2
2σ
2
2
= P r(x
1
) · Pr(x
2
). (C.24)
Draft: please send errata to udlbookmail@gmail.com.
458 C Probability
Figure C.5 Change of variables. a) The conditional distribution P r(x|y) is a
normal distribution with constant variance and a mean that depends linearly on y.
Cyan distribution shows one example for y = 0.2. b) This is proportional to
the conditional probability P r(y|x), which is a normal distribution with constant
variance and a mean that depends linearly on x. Cyan distribution shows one
example for x = 3.
C.3.3 Product of two normal distributions
The product of two normal distributions is proportional to a third normal distribution
according to the relation:
Norm
x
[a, A]Norm
x
[b, B] Norm
x
h
(A
1
+ B
1
)
1
(A
1
a + B
1
b), (A
1
+ B
1
)
1
i
.
This is easily proved by multiplying out the exponential terms and completing the square
(see problem 18.5).
C.3.4 Change of variable
When the mean of a multivariate normal in x is a linear function Ay + b of a second
variable y, this is proportional to another normal distribution in y, where the mean is a
linear function of x:
Norm
x
[Ay + b, Σ] Norm
y
[(A
T
Σ
1
A)
1
A
T
Σ
1
(x b), (A
T
Σ
1
A)
1
]. (C.25)
At rst sight, this relation is rather opaque, but gure C.5 shows the case for scalar x
and y, which is easy to understand. As for the previous relation, this can be proved by
expanding the quadratic product in the exponential term and completing the square to
make this a distribution in y. (see problem 18.4).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
C.4 Sampling 459
C.4 Sampling
To sample from a univariate distribution P r(x), we rst compute the cumulative distribu-
tion F[x] (the integral of P r(x)). Then we draw a sample z
from a uniform distribution
over the range [0, 1] and evaluate this against the inverse of the cumulative distribution,
so the sample x
is created as:
x
= F
1
[z
]. (C.26)
C.4.1 Sampling from normal distributions
The method above can be used to generate a sample x
from a univariate standard normal
distribution. A sample from a normal distribution with mean µ and variance σ
2
can then
be created using equation C.18. Similarly, a sample x
from a D-dimensional multivariate
standard distribution can be created by independently sampling D univariate standard
normal variables. A sample from a multivariate normal distribution with mean µ and
covariance Σ can be then created using equation C.20.
C.4.2 Ancestral sampling
When the joint distribution can be factored into a series of conditional probabilities, we
can generate samples using ancestral sampling. The basic idea is to generate a sample
from the root variable(s) and then sample from the subsequent conditional distributions
based on this instantiation. This process is known as ancestral sampling and is easiest
to understand with an example. Consider a joint distribution over three variables, x, y,
and z, where the distribution factors as:
P r(x, y, z) = P r(x)P r(y|x)P r(z|y). (C.27)
To sample from this joint distribution, we rst draw a sample x
from P r(x). Then we
draw a sample y
from P r(y|x
). Finally, we draw a sample z
from P r(z|y
).
C.5 Distances between probability distributions
Supervised learning can be framed in terms of minimizing the distance between the
probability distribution implied by the model and the discrete probability distribution
implied by the samples (section 5.7). Unsupervised learning can often be framed in
terms of minimizing the distance between the probability distribution of real examples
and the distribution of data from the model. In both cases, we need a measure of distance
between two probability distributions. This section considers the properties of several
dierent measures of distance between distributions (see also gure 15.8 for a discussion
of the Wasserstein or earth mover’s distance).
Draft: please send errata to udlbookmail@gmail.com.
460 C Probability
Figure C.6 Lower bound on negative log-
arithm. The function 1 y is always
less than the function log[y]. This re-
lation is used to show that the Kullback-
Leibler divergence is always greater than
or equal to zero.
C.5.1 Kullback-Leibler divergence
The most common measure of distance between probability distributions p(x) and q(x)
is the Kullback-Leibler or KL divergence and is dened as:
D
KL
p(x)||q(x)
=
Z
p(x) log
p(x)
q(x)
dx. (C.28)
This distance is always greater than or equal to zero, which is easily demonstrated by
noting that log[y] 1 y (gure C.6) so:
D
KL
h
p(x)
q(x)
i
=
Z
p(x) log
p(x)
q(x)
dx
=
Z
p(x) log
q(x)
p(x)
dx
Z
p(x)
1
q(x)
p(x)
dx
=
Z
p(x) q(x)dx
= 1 1 = 0. (C.29)
The KL divergence is innite if there are places where q(x) is zero but p(x) is non-zero.
This can lead to problems when we are minimizing a function based on this distance.
C.5.2 Jensen-Shannon divergence
The KL divergence is not symmetric (i.e., D
KL
[p(x)||q(x)]= D
KL
[q(x)||p(x)]). The
Jensen-Shannon divergence is a measure of distance that is symmetric by construction:
D
JS
h
p(x)
q(x)
i
=
1
2
D
KL
p(x)
p(x) + q(x)
2
+
1
2
D
KL
q(x)
p(x) + q(x)
2
. (C.30)
It is the mean divergence of p(x) and q(x) to the average of the two distributions.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
C.5 Distances between probability distributions 461
C.5.3 Fréchet distance
The Fréchet distance D
F R
between two distributions p(x) and q(x) is given by:
D
F r
h
p(x)
q(y)
i
=
s
min
π (x,y)
ZZ
π(x, y)|x y|
2
dxdy
, (C.31)
where π(x, y) represents the set of joint distributions that are compatible with the
marginal distributions p(x) and q(y). The Fréchet distance can also be formulated as a
measure of the maximum distance between the cumulative probability curves.
C.5.4 Distances between normal distributions
Often we want to compute the distance between two multivariate normal distributions
with means µ
1
and µ
2
and covariances Σ
1
and Σ
2
. In this case, various measures of
distance can be written in closed form.
The KL divergence can be computed as:
D
KL
h
Norm[µ
1
, Σ
1
]
Norm[µ
2
, Σ
2
]
i
= (C.32)
1
2
log
|Σ
2
|
|Σ
1
|
D + tr
Σ
1
2
Σ
1
+ (µ
2
µ
1
)Σ
1
2
(µ
2
µ
1
)
.
where tr[] is the trace of the matrix argument. The Fréchet/2-Wasserstein distance is
given by:
D
2
F r/W
2
h
Norm[µ
1
, Σ
1
]
Norm[µ
2
, Σ
2
]
i
= |µ
1
µ
2
|
2
+ tr
h
Σ
1
+ Σ
2
2 (Σ
1
Σ
2
)
1/2
i
.
(C.33)
Draft: please send errata to udlbookmail@gmail.com.
Bibliography
Abdal, R., Qin, Y., & Wonka, P. (2019). Im-
age2StyleGAN: How to embed images into the
StyleGAN latent space? IEEE/CVF Interna-
tional Conference on Computer Vision, 4432–
4441. 301
Abdal, R., Qin, Y., & Wonka, P. (2020). Im-
age2StyleGAN++: How to edit the embedded
images? IEEE/CVF Computer Vision & Pat-
tern Recognition, 8296–8305. 301
Abdal, R., Zhu, P., Mitra, N. J., & Wonka, P.
(2021). StyleFlow: Attribute-conditioned ex-
ploration of StyleGAN-generated images us-
ing conditional continuous normalizing ows.
ACM Transactions on Graphics (ToG), 40(3),
1–21. 300, 322
Abdalla, M., & Abdalla, M. (2021). The grey
hoodie project: Big tobacco, big tech, and
the threat on academic integrity. AAAI/ACM
Conference on AI, Ethics, and Society, 287–
297. 434
Abdel-Hamid, O., Mohamed, A.-r., Jiang, H., &
Penn, G. (2012). Applying convolutional neu-
ral networks concepts to hybrid NN-HMM
model for speech recognition. IEEE Interna-
tional Conference on Acoustics, Speech and
Signal Processing, 4277–4280. 182
Abdelhamed, A., Brubaker, M. A., & Brown, M. S.
(2019). Noise ow: Noise modeling with condi-
tional normalizing ows. IEEE/CVF Interna-
tional Conference on Computer Vision, 3165–
3173. 322
Abeßer, J., Mimilakis, S. I., Gräfe, R., Lukashe-
vich, H., & Fraunhofer, I. (2017). Acoustic
scene classication by combining autoencoder-
based dimensionality reduction and convolu-
tional neural networks. Workshop on Detec-
tion and Classication of Acoustic Scenes and
Events, 7–11. 160
Abrahams, D. (2023). Let’s talk about genera-
tive AI and fraud. Forter Blog, March 27,
2023. https://www.forter.com/blog/lets-
talk-about-generative-ai-and-fraud/. 428
Abu-El-Haija, S., Perozzi, B., Kapoor, A., Alipour-
fard, N., Lerman, K., Harutyunyan, H.,
Ver Steeg, G., & Galstyan, A. (2019). MixHop:
Higher-order graph convolutional architectures
via sparsied neighborhood mixing. Interna-
tional Conference on Machine Learning, 21–
29. 263
Adler, J., & Lunz, S. (2018). Banach Wasserstein
GAN. Neural Information Processing Systems,
31, 6755–6764. 299
Agarwal, R., Schuurmans, D., & Norouzi, M.
(2020). An optimistic perspective on oine
reinforcement learning. International Confer-
ence on Machine Learning, 104–114. 398
Aggarwal, C. C., Hinneburg, A., & Keim, D. A.
(2001). On the surprising behavior of distance
metrics in high dimensional space. Interna-
tional Conference on Database Theory, 420–
434. 135
Agüera y Arcas, B., Todorov, A., & Mitchell,
M. (2018). Do algorithms reveal sexual
orientation or just expose our stereo-
types? Medium, Jan 11, 2018. https :
/ / medium . com / @blaisea / do - algorithms -
reveal - sexual - orientation - or - just -
expose-our-stereotypes-d998fafdf477. 431
Ahmed, N., & Wahed, M. (2016). The de-
democratization of AI: Deep learning and the
compute divide in articial intelligence re-
search. arXiv:1606.06565. 430
Ahmed, S., Mula, R. S., & Dhavala, S. S.
(2020). A framework for democratizing AI.
arXiv:2001.00818. 430
Ahmed, T. (2017). AI can tell if you’re
gay: Articial intelligence predicts sexu-
ality from one photo with startling accu-
racy. Newsweek, 8 Sept 2017. https : / /
www.newsweek.com/ai- can- tell- if- youre-
gay - artificial - intelligence - predicts -
sexuality-one-photo-661643. 430
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 463
Aiken, M., & Park, M. (2010). The ecacy
of round-trip translation for MT evaluation.
Translation Journal, 14(1). 160
Ainslie, J., Ontañón, S., Alberti, C., Cvicek, V.,
Fisher, Z., Pham, P., Ravula, A., Sanghai, S.,
Wang, Q., & Yang, L. (2020). ETC: Encod-
ing long and structured inputs in transformers.
ACL Empirical Methods in Natural Language
Processing, 268–284. 237
Akers, J., Bansal, G., Cadamuro, G., Chen, C.,
Chen, Q., Lin, L., Mulcaire, P., Nandaku-
mar, R., Rockett, M., Simko, L., Toman,
J., Wu, T., Zeng, E., Zorn, B., & Roes-
ner, F. (2018). Technology-enabled disinfor-
mation: Summary, lessons, and recommenda-
tions. arXiv:1812.09383. 427
Akuzawa, K., Iwasawa, Y., & Matsuo, Y. (2018).
Expressive speech synthesis via modeling ex-
pressions with variational autoencoder. IN-
TERPSPEECH, 3067–3071. 343
Ali, A., Touvron, H., Caron, M., Bojanowski, P.,
Douze, M., Joulin, A., Laptev, I., Neverova,
N., Synnaeve, G., Verbeek, J., et al. (2021).
XCiT: Cross-covariance image transformers.
Neural Information Processing Systems, 34,
20014–20027. 238
Allen, C., Smit, I., & Wallach, W. (2005). Articial
morality: Top-down, bottom-up, and hybrid
approaches. Ethics and Information Technol-
ogy, 7, 149–155. 424
Allen-Zhu, Z., Li, Y., & Song, Z. (2019). A con-
vergence theory for deep learning via over-
parameterization. International Conference
on Machine Learning, 97, 242–252. 404
Alon, U., & Yahav, E. (2021). On the bottleneck of
graph neural networks and its practical impli-
cations. International Conference on Learning
Representations. 265
Alvarez, J. M., & Salzmann, M. (2016). Learning
the number of neurons in deep networks. Neu-
ral Information Processing Systems, 29, 2262–
2270. 414
Amari, S.-I. (1998). Natural gradient works ef-
ciently in learning. Neural Computation,
10(2), 251–276. 397
Amodei, D., Olah, C., Steinhardt, J., Christiano,
P., Schulman, J., & Mané, D. (2016). Concrete
problems in AI safety. arXiv:1606.06565. 421
An, G. (1996). The eects of adding noise during
backpropagation training on a generalization
performance. Neural Computation, 8(3), 643–
674. 158
An, J., Huang, S., Song, Y., Dou, D., Liu, W., &
Luo, J. (2021). ArtFlow: Unbiased image style
transfer via reversible neural ows. IEEE/CVF
Computer Vision & Pattern Recognition, 862–
871. 322
Anderson, M., & Anderson, S. L. (2008). Ethical
healthcare agents. Advanced Computational
Intelligence Paradigms in Healthcare 3. Stud-
ies in Computational Intelligence, vol. 107,
233–257. 424
Andreae, J. (1969). Learning machines: A unied
view. Encyclopaedia of Linguistics, Informa-
tion and Control, 261–270. 396
Angwin, J., Larson, J., Mattu, S., & Kirchner, L.
(2016). Machine bias: There’s software used
across the country to predict future criminals.
and it’s biased against blacks. ProPublica,
May 23, 2016. https://www.propublica.org/
article/machine - bias - risk- assessments -
in-criminal-sentencing. 420
Ardizzone, L., Kruse, J., Lüth, C., Bracher, N.,
Rother, C., & Köthe, U. (2020). Conditional
invertible neural networks for diverse image-to-
image translation. DAGM German Conference
on Pattern Recognition, 373–387. 322
Arjovsky, M., & Bottou, L. (2017). Towards prin-
cipled methods for training generative adver-
sarial networks. International Conference on
Learning Representations. 283, 299
Arjovsky, M., Chintala, S., & Bottou, L. (2017).
Wasserstein generative adversarial networks.
International Conference on Machine Learn-
ing, 214–223. 280, 299
Arkin, R. C. (2008a). Governing lethal behav-
ior: Embedding ethics in a hybrid delibera-
tive/reactive robot architecture—Part I: Mo-
tivation and philosophy. ACM/IEEE Interna-
tional Conference on Human Robot Interac-
tion, 121–128. 424
Arkin, R. C. (2008b). Governing lethal behav-
ior: Embedding ethics in a hybrid delibera-
tive/reactive robot architecture—Part II: For-
malization for ethical control. Conference on
Articial General Intelligence, 51–62. 424
Arnab, A., Dehghani, M., Heigold, G., Sun, C.,
Lučić, M., & Schmid, C. (2021). ViVit: A
video vision transformer. IEEE/CVF Interna-
tional Conference on Computer Vision, 6836–
6846. 238
Arora, R., Basu, A., Mianjy, P., & Mukherjee, A.
(2016). Understanding deep neural networks
with rectied linear units. arXiv:1611.01491.
52
Draft: please send errata to udlbookmail@gmail.com.
464 Bibliography
Arora, S., Ge, R., Liang, Y., Ma, T., & Zhang,
Y. (2017). Generalization and equilibrium in
generative adversarial nets (GANs). Interna-
tional Conference on Machine Learning, 224–
232. 300
Arora, S., Li, Z., & Lyu, K. (2018). Theoretical
analysis of auto rate-tuning by batch normal-
ization. arXiv:1812.03981. 204
Arora, S., & Zhang, Y. (2017). Do GANs actually
learn the distribution? An empirical study.
arXiv:1706.08224. 300
Arulkumaran, K., Deisenroth, M. P., Brundage,
M., & Bharath, A. A. (2017). Deep reinforce-
ment learning: A brief survey. IEEE Signal
Processing Magazine, 34(6), 26–38. 396
Asaro, P. (2012). On banning autonomous weapon
systems: human rights, automation, and the
dehumanization of lethal decision-making. In-
ternational Review of the Red Cross, 94(886),
687–709. 429
Atwood, J., & Towsley, D. (2016). Diusion-
convolutional neural networks. Neural Infor-
mation Processing Systems, 29, 1993–2001.
262
Aubret, A., Matignon, L., & Hassas, S. (2019). A
survey on intrinsic motivation in reinforcement
learning. arXiv:1908.06976. 398
Austin, J., Johnson, D. D., Ho, J., Tarlow, D., &
van den Berg, R. (2021). Structured denois-
ing diusion models in discrete state-spaces.
Neural Information Processing Systems, 34,
17981–17993. 369
Awad, E., Dsouza, S., Kim, R., Schulz, J., Henrich,
J., Shari, A., Bonnefon, J.-F., & Rahwan, I.
(2018). The moral machine experiment. Na-
ture, 563, 59–64. 424
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016).
Layer normalization. arXiv:1607.06450. 203
Bachlechner, T., Majumder, B. P., Mao, H., Cot-
trell, G., & McAuley, J. (2021). ReZero is all
you need: Fast convergence at large depth. Un-
certainty in Articial Intelligence, 1352–1361.
238
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neu-
ral machine translation by jointly learning to
align and translate. International Conference
on Learning Representations. 233, 235
Bahri, Y., Kadmon, J., Pennington, J., Schoen-
holz, S. S., Sohl-Dickstein, J., & Ganguli, S.
(2020). Statistical mechanics of deep learning.
Annual Review of Condensed Matter Physics,
11, 501–528. 409, 410
Baldi, P., & Hornik, K. (1989). Neural networks
and principal component analysis: Learning
from examples without local minima. Neural
networks, 2(1), 53–58. 410
Balduzzi, D., Frean, M., Leary, L., Lewis, J., Ma,
K. W.-D., & McWilliams, B. (2017). The shat-
tered gradients problem: If ResNets are the
answer, then what is the question? Interna-
tional Conference on Machine Learning, 342–
350. 188, 202, 203, 205
Bansal, A., Borgnia, E., Chu, H.-M., Li, J. S.,
Kazemi, H., Huang, F., Goldblum, M., Geip-
ing, J., & Goldstein, T. (2022). Cold diusion:
Inverting arbitrary image transforms without
noise. arXiv:2208.09392. 369
Bao, F., Li, C., Zhu, J., & Zhang, B. (2022).
Analytic-DPM: An analytic estimate of the op-
timal reverse variance in diusion probabilistic
models. International Conference on Learning
Representations. 369
Baranchuk, D., Rubachev, I., Voynov, A.,
Khrulkov, V., & Babenko, A. (2022). Label-
ecient semantic segmentation with diusion
models. International Conference on Learning
Representations. 369
Barber, D., & Bishop, C. (1997). Ensemble learn-
ing for multi-layer networks. Neural Informa-
tion Processing Systems, 10, 395–401. 159
Barocas, S., Hardt, M., & Narayanan, A. (2023).
Fairness and Machine Learning: Limitations
and Opportunities. MIT Press. 423
Barratt, S., & Sharma, R. (2018). A note on the in-
ception score. Workshop on Theoretical Foun-
dations and Applications of Deep Generative
Models. 274
Barrett, D. G. T., & Dherin, B. (2021). Implicit
gradient regularization. International Confer-
ence on Learning Representations. 157
Barrett, L. (2020). Ban facial recognition tech-
nologies for children and for everyone else.
Boston University Journal of Science and
Technology Law, 26(2), 223–285. 427
Barron, J. T. (2019). A general and adaptive ro-
bust loss function. IEEE/CVF Computer Vi-
sion & Pattern Recognition, 4331–4339. 73
Bartlett, P. L., Foster, D. J., & Telgarsky, M. J.
(2017). Spectrally-normalized margin bounds
for neural networks. Neural Information Pro-
cessing Systems, vol. 30, 6240–6249. 156
Bartlett, P. L., Harvey, N., Liaw, C., & Mehrabian,
A. (2019). Nearly-tight VC-dimension and
pseudodimension bounds for piecewise linear
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 465
neural networks. Journal of Machine Learn-
ing Research, 20(1), 2285–2301. 134
Barto, A. G. (2013). Intrinsic motivation and re-
inforcement learning. Intrinsically Motivated
Learning in Natural and Articial Systems,
17–47. 398
Bau, D., Zhou, B., Khosla, A., Oliva, A., & Tor-
ralba, A. (2017). Network dissection: Quanti-
fying interpretability of deep visual representa-
tions. IEEE/CVF Computer Vision & Pattern
Recognition, 6541–6549. 184
Bau, D., Zhu, J.-Y., Wul, J., Peebles, W., Stro-
belt, H., Zhou, B., & Torralba, A. (2019). See-
ing what a GAN cannot generate. IEEE/CVF
International Conference on Computer Vi-
sion, 4502–4511. 300
Baydin, A. G., Pearlmutter, B. A., Radul, A. A.,
& Siskind, J. M. (2018). Automatic dierenti-
ation in machine learning: A survey. Journal
of Marchine Learning Research, 18, 1–43. 113
Bayer, M., Kaufhold, M.-A., & Reuter, C. (2022).
A survey on data augmentation for text classi-
cation. ACM Computing Surveys, 55(7), 1–
39. 160
Behrmann, J., Grathwohl, W., Chen, R. T., Duve-
naud, D., & Jacobsen, J.-H. (2019). Invertible
residual networks. International Conference
on Machine Learning, 573–582. 318, 323
Belinkov, Y., & Bisk, Y. (2018). Synthetic and
natural noise both break neural machine trans-
lation. International Conference on Learning
Representations. 160
Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019).
Reconciling modern machine-learning practice
and the classical bias–variance trade-o. Pro-
ceedings of the National Academy of Sciences,
116(32), 15849–15854. 130, 134
Bellemare, M. G., Dabney, W., & Munos, R.
(2017a). A distributional perspective on rein-
forcement learning. International Conference
on Machine Learning, 449–458. 397
Bellemare, M. G., Danihelka, I., Dabney, W., Mo-
hamed, S., Lakshminarayanan, B., Hoyer, S.,
& Munos, R. (2017b). The Cramer distance
as a solution to biased Wasserstein gradients.
arXiv:1705.10743. 299
Bellman, R. (1966). Dynamic programming. Sci-
ence, 153(3731), 34–37. 396
Beltagy, I., Peters, M. E., & Cohan, A. (2020).
Longformer: The long-document transformer.
arXiv:2004.05150. 237
Bender, E. M., & Koller, A. (2020). Climbing
towards NLU: On meaning, form, and un-
derstanding in the age of data. Meeting of
the Association for Computational Linguistics,
5185–5198. 234
Bengio, Y., Ducharme, R., & Vincent, P. (2000).
A neural probabilistic language model. Neural
Information Processing Systems, 13, 932–938.
274
Benjamin, R. (2019). Race After Technology: Abo-
litionist Tools for the New Jim Code. Polity.
433
Berard, H., Gidel, G., Almahairi, A., Vincent, P.,
& Lacoste-Julien, S. (2019). A closer look at
the optimization landscapes of generative ad-
versarial networks. arXiv:1906.04848. 299
Berger, P. (2019). MTA’s initial foray into facial
recognition at high speed is a bust. April 07,
2019. https://www.wsj.com/articles/mtas-
initial- foray - into- facial- recognition -
at-high-speed-is-a-bust-11554642000. 427
Bergstra, J., & Bengio, Y. (2012). Random search
for hyper-parameter optimization. Journal of
Machine Learning Research, 13(10), 281–305.
136
Bergstra, J. S., Bardenet, R., Bengio, Y., & Kégl,
B. (2011). Algorithms for hyper-parameter
optimization. Neural Information Processing
Systems, vol. 24, 2546–2554. 136
Berk, R., Heidari, H., Jabbari, S., Kearns, M., &
Roth, A. (2017). Fairness in criminal justice
risk assessments: the state of the art. Socio-
logical Methods & Research, 50(1), 3–44. 422
Berner, C., Brockman, G., Chan, B., Cheung, V.,
Dębiak, P., Dennison, C., Farhi, D., Fischer,
Q., Hashme, S., Hesse, C., et al. (2019). DOTA
2 with large scale deep reinforcement learning.
arXiv:1912.06680. 396
Bertasius, G., Wang, H., & Torresani, L. (2021).
Is space-time attention all you need for video
understanding? International Conference on
Machine Learning, 3, 813–824. 238
Beyer, K., Goldstein, J., Ramakrishnan, R., &
Shaft, U. (1999). When is “nearest neigh-
bor” meaningful? International Conference
on Database Theory, 217–235. 135
Binns, R. (2018). Algorithmic accountability and
public reason. Philosophy & Technology, 31(4),
543–556. 13
Birhane, A., Isaac, W., Prabhakaran, V., Diaz,
M., Elish, M. C., Gabriel, I., & Mohamed, S.
(2022a). Power to the people? Opportunities
and challenges for participatory AI. Equity and
Draft: please send errata to udlbookmail@gmail.com.
466 Bibliography
Access in Algorithms, Mechanisms, and Opti-
mization. 433
Birhane, A., Kalluri, P., Card, D., Agnew, W.,
Dotan, R., & Bao, M. (2022b). The values
encoded in machine learning research. ACM
Conference on Fairness, Accountability, and
Transparency, 173–184. 431
Bishop, C. (1995). Regularization and complex-
ity control in feed-forward networks. Inter-
national Conference on Articial Neural Net-
works, 141–148. 157, 158
Bishop, C. M. (1994). Mixture density networks.
Aston University Technical Report. 73
Bishop, C. M. (2006). Pattern recognition and ma-
chine learning. Springer. 15, 159
Bjorck, N., Gomes, C. P., Selman, B., & Wein-
berger, K. Q. (2018). Understanding batch
normalization. Neural Information Processing
Systems, 31, 7705–7716. 204
Blum, A. L., & Rivest, R. L. (1992). Training a
3-node neural network is NP-complete. Neural
Networks, 5(1), 117–127. 401
Blundell, C., Cornebise, J., Kavukcuoglu, K., &
Wierstra, D. (2015). Weight uncertainty in
neural network. International Conference on
Machine Learning, 1613–1622. 159
Bond-Taylor, S., Leach, A., Long, Y., & Willcocks,
C. G. (2022). Deep generative modelling: A
comparative review of VAEs, GANs, normal-
izing ows, energy-based and autoregressive
models. IEEE Transactions on Pattern Analy-
sis & Machine Intelligence, 44(11), 7327–7347.
274
Bontridder, N., & Poullet, Y. (2021). The role of
articial intelligence in disinformation. Data &
Policy, 3, E32. 427
Borji, A. (2022). Pros and cons of GAN evalua-
tion measures: New developments. Computer
Vision & Image Understanding, 215, 103329.
274
Bornschein, J., Shabanian, S., Fischer, A., & Ben-
gio, Y. (2016). Bidirectional Helmholtz ma-
chines. International Conference on Machine
Learning, 2511–2519. 346
Boscaini, D., Masci, J., Rodolà, E., & Bron-
stein, M. (2016). Learning shape correspon-
dence with anisotropic convolutional neural
networks. Neural Information Processing Sys-
tems, 29, 3189–3197. 265
Bottou, L. (2012). Stochastic gradient descent
tricks. Neural Networks: Tricks of the Trade:
Second Edition, 421–436. 91
Bottou, L., Curtis, F. E., & Nocedal, J. (2018).
Optimization methods for large-scale machine
learning. SIAM Review, 60(2), 223–311. 91
Bottou, L., Soulié, F. F., Blanchet, P., & Lié-
nard, J.-S. (1990). Speaker-independent iso-
lated digit recognition: Multilayer perceptrons
vs. dynamic time warping. Neural Networks,
3(4), 453–465. 181
Boulemtafes, A., Derhab, A., & Challal, Y. (2020).
A review of privacy-preserving techniques for
deep learning. Neurocomputing, 384, 21–45.
428
Bousselham, W., Thibault, G., Pagano, L.,
Machireddy, A., Gray, J., Chang, Y. H., &
Song, X. (2021). Ecient self-ensemble
framework for semantic segmentation.
arXiv:2111.13280. 162
Bowman, S. R., & Dahl, G. E. (2021). What will it
take to x benchmarking in natural language
understanding? ACL Human Language Tech-
nologies, 4843–4855. 234
Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M.,
Jozefowicz, R., & Bengio, S. (2015). Generat-
ing sentences from a continuous space. ACL
Conference on Computational Natural Lan-
guage Learning, 10–21. 343, 344, 345
Braverman, H. (1974). Labor and monopoly cap-
ital: the degradation of work in the twentieth
century. Monthly Review Press. 429
Brock, A., Donahue, J., & Simonyan, K. (2019).
Large scale GAN training for high delity nat-
ural image synthesis. International Conference
on Learning Representations. 287, 299
Brock, A., Lim, T., Ritchie, J. M., & Weston, N.
(2016). Neural photo editing with introspec-
tive adversarial networks. International Con-
ference on Learning Representations. 345
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., &
Shah, R. (1993). Signature verication using a
“Siamese” time delay neural network. Neural
Information Processing Systems, 6, 737–744.
181
Bronstein, M. M., Bruna, J., Cohen, T., &
Veličković, P. (2021). Geometric deep learning:
Grids, groups, graphs, geodesics, and gauges.
arXiv:2104.13478. 262
Broussard, M. (2018). Articial Unintelligence:
How Computers Misunderstand the World.
The MIT Press. 433
Broussard, M. (2023). More than a Glitch: Con-
fronting Race, Gender, and Ability Bias in
Tech. The MIT Press. 433
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 467
Brown, T., Mann, B., Ryder, N., Subbiah, M.,
Kaplan, J. D., Dhariwal, P., Neelakantan, A.,
Shyam, P., Sastry, G., Askell, A., et al. (2020).
Language models are few-shot learners. Neu-
ral Information Processing Systems, 33, 1877–
1901. 9, 159, 234, 237, 422, 425
Brügger, R., Baumgartner, C. F., & Konukoglu,
E. (2019). A partially reversible U-Net for
memory-ecient volumetric image segmenta-
tion. International Conference on Medical Im-
age Computing and Computer-Assisted Inter-
vention, 429–437. 322
Bruna, J., Zaremba, W., Szlam, A., & LeCun,
Y. (2013). Spectral networks and locally con-
nected networks on graphs. International Con-
ference on Learning Representations. 262
Brynjolfsson, E., & McAfee, A. (2016). The Second
Machine Age: Work, Progress, and Prosperity
in a Time of Brilliant Technologies. W. W.
Norton. 430
Bryson, A., Ho, Y.-C., & Siouris, G. (1979). Ap-
plied optimal control: Optimization, estima-
tion, and control. IEEE Transactions on Sys-
tems, Man & Cybernetics, 9, 366–367. 113
Bubeck, S., & Sellke, M. (2021). A universal law
of robustness via isoperimetry. Neural Infor-
mation Processing Systems, 34, 28811–28822.
135, 416
Buciluǎ, C., Caruana, R., & Niculescu-Mizil, A.
(2006). Model compression. ACM SIGKDD
International Conference on Knowledge Dis-
covery and Data Mining, 535–541. 415
Bughin, J., Seong, J., Manyika, J., Chui, M., &
Joshi, R. (2018). Notes from the AI Frontier:
Modelling the Impact of AI on the World Econ-
omy. McKinsey Global Institute, Sept 4, 2018.
429
Buolamwini, J., & Gebru, T. (2018). Gender
shades: Intersectional accuracy disparities in
commercial gender classication. Proceedings
of Machine Learning Research, 81. 423
Burda, Y., Grosse, R. B., & Salakhutdinov, R.
(2016). Importance weighted autoencoders. In-
ternational Conference on Learning Represen-
tations. 73, 346
Buschjäger, S., & Morik, K. (2021). There
is no double-descent in random forests.
arXiv:2111.04409. 134
Cai, T., Luo, S., Xu, K., He, D., Liu, T.-y., &
Wang, L. (2021). GraphNorm: A principled
approach to accelerating graph neural network
training. International Conference on Machine
Learning, 1204–1215. 265
Calimeri, F., Marzullo, A., Stamile, C., & Ter-
racina, G. (2017). Biomedical data augmenta-
tion using adversarial neural networks. Inter-
national Conference on Articial Neural Net-
works, 626–634. 159
Calo, R. (2018). Articial intelligence policy: A
primer and roadmap. University of Bologna
Law Review, 3(2), 180–218. 430
Cao, H., Tan, C., Gao, Z., Chen, G., Heng, P.-
A., & Li, S. Z. (2022). A survey on generative
diusion model. arXiv:2209.02646. 369
Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., & Li,
H. (2007). Learning to rank: From pairwise
approach to listwise approach. International
Conference on Machine Learning, 129–136. 73
Carion, N., Massa, F., Synnaeve, G., Usunier, N.,
Kirillov, A., & Zagoruyko, S. (2020). End-to-
end object detection with transformers. Eu-
ropean Conference on Computer Vision, 213–
229. 238
Carlini, N., Hayes, J., Nasr, M., Jagielski, M., Se-
hwag, V., Tramèr, F., Balle, B., Ippolito, D.,
& Wallace, E. (2023). Extracting training data
from diusion models. arXiv:2301.13188. 428
Carlini, N., Ippolito, D., Jagielski, M., Lee, K.,
Tramer, F., , & Zhang, C. (2022). Quantifying
memorization across neural language models.
arXiv:2202.07646. 428
Cauchy, A. (1847). Methode generale pour la res-
olution des systemes d’equations simultanees.
Comptes Rendus de l’Académie des Sciences,
25. 91
Cervantes, J.-A., López, S., Rodríguez, L.-F., Cer-
vantes, S., Cervantes, F., & Ramos, F. (2019).
Articial moral agents: A survey of the cur-
rent status. Science and Engineering Ethics,
26, 501–532. 424
Ceylan, G., Anderson, I. A., & Wood, W. (2023).
Sharing of misinformation is habitual, not just
lazy or biased. Proceedings of the National
Academy of Sciences of the United States of
America, 120(4). 432
Chami, I., Abu-El-Haija, S., Perozzi, B., Ré, C.,
& Murphy, K. (2020). Machine learning on
graphs: A model and comprehensive taxon-
omy. arXiv:2005.03675. 261
Chang, B., Chen, M., Haber, E., & Chi, E. H.
(2019a). AntisymmetricRNN: A dynamical
system view on recurrent neural networks. In-
ternational Conference on Learning Represen-
tations. 323
Chang, B., Meng, L., Haber, E., Ruthotto, L.,
Begert, D., & Holtham, E. (2018). Reversible
Draft: please send errata to udlbookmail@gmail.com.
468 Bibliography
architectures for arbitrarily deep residual neu-
ral networks. AAAI Conference on Articial
Intelligence, 2811–2818. 323
Chang, Y.-L., Liu, Z. Y., Lee, K.-Y., & Hsu,
W. (2019b). Free-form video inpainting with
3D gated convolution and temporal Patch-
GAN. IEEE/CVF International Conference
on Computer Vision, 9066–9075. 181
Chaudhari, P., Choromanska, A., Soatto, S., Le-
Cun, Y., Baldassi, C., Borgs, C., Chayes, J.,
Sagun, L., & Zecchina, R. (2019). Entropy-
SGD: Biasing gradient descent into wide val-
leys. Journal of Statistical Mechanics: Theory
and Experiment, 12, 124018. 158, 411
Chen, D., Mei, J.-P., Zhang, Y., Wang, C., Wang,
Z., Feng, Y., & Chen, C. (2021a). Cross-layer
distillation with semantic calibration. AAAI
Conference on Articial Intelligence, 7028–
7036. 416
Chen, H., Wang, Y., Guo, T., Xu, C., Deng, Y.,
Liu, Z., Ma, S., Xu, C., Xu, C., & Gao, W.
(2021b). Pre-trained image processing trans-
former. IEEE/CVF Computer Vision & Pat-
tern Recognition, 12299–12310. 238
Chen, J., Ma, T., & Xiao, C. (2018a). FastGCN:
Fast learning with graph convolutional net-
works via importance sampling. International
Conference on Learning Representations. 264,
265
Chen, J., Zhu, J., & Song, L. (2018b). Stochastic
training of graph convolutional networks with
variance reduction. International Conference
on Machine Learning, 941–949. 264
Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover,
A., Laskin, M., Abbeel, P., Srinivas, A., &
Mordatch, I. (2021c). Decision transformer:
Reinforcement learning via sequence modeling.
Neural Information Processing Systems, 34,
15084–15097. 398
Chen, L.-C., Papandreou, G., Kokkinos, I., Mur-
phy, K., & Yuille, A. L. (2018c). DeepLab:
Semantic image segmentation with deep con-
volutional nets, atrous convolution, and fully
connected CRFs. IEEE Transactions on Pat-
tern Analysis & Machine Intelligence, 40(4),
834–—848. 181
Chen, M., Radford, A., Child, R., Wu, J., Jun, H.,
Luan, D., & Sutskever, I. (2020a). Generative
pretraining from pixels. International Confer-
ence on Machine Learning, 1691–1703. 238
Chen, M., Wei, Z., Huang, Z., Ding, B., & Li, Y.
(2020b). Simple and deep graph convolutional
networks. International Conference on Ma-
chine Learning, 1725–1735. 266
Chen, N., Zhang, Y., Zen, H., Weiss, R. J.,
Norouzi, M., Dehak, N., & Chan, W. (2021d).
WaveGrad 2: Iterative renement for text-
to-speech synthesis. INTERSPEECH, 3765–
3769. 369
Chen, R. T., Behrmann, J., Duvenaud, D. K., &
Jacobsen, J.-H. (2019). Residual ows for in-
vertible generative modeling. Neural Informa-
tion Processing Systems, 32, 9913–9923. 324
Chen, R. T., Li, X., Grosse, R. B., & Duvenaud,
D. K. (2018d). Isolating sources of disentangle-
ment in variational autoencoders. Neural In-
formation Processing Systems, 31, 2615–2625.
343, 346
Chen, R. T., Rubanova, Y., Bettencourt, J., & Du-
venaud, D. K. (2018e). Neural ordinary dier-
ential equations. Neural Information Process-
ing Systems, 31, 6572–6583. 324
Chen, T., Fox, E., & Guestrin, C. (2014). Stochas-
tic gradient Hamiltonian Monte Carlo. In-
ternational Conference on Machine Learning,
1683–1691. 159
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G.
(2020c). A simple framework for contrastive
learning of visual representations. Interna-
tional Conference on Machine Learning, 1597–
1607. 159
Chen, T., Xu, B., Zhang, C., & Guestrin, C.
(2016a). Training deep nets with sublinear
memory cost. arXiv:1604.06174. 114
Chen, W., Liu, T.-Y., Lan, Y., Ma, Z.-M., & Li,
H. (2009). Ranking measures and loss func-
tions in learning to rank. Neural Information
Processing Systems, 22, 315–323. 73
Chen, X., Duan, Y., Houthooft, R., Schulman,
J., Sutskever, I., & Abbeel, P. (2016b). Info-
GAN: Interpretable representation learning by
information maximizing generative adversarial
nets. Neural Information Processing Systems,
29, 2172–2180. 291, 301
Chen, X., Kingma, D. P., Salimans, T., Duan, Y.,
Dhariwal, P., Schulman, J., Sutskever, I., &
Abbeel, P. (2017). Variational lossy autoen-
coder. International Conference on Learning
Representations. 345
Chen, Y.-C., Li, L., Yu, L., El Kholy, A., Ahmed,
F., Gan, Z., Cheng, Y., & Liu, J. (2020d).
UNITER: Universal image-text representation
learning. European Conference on Computer
Vision, 104–120. 238
Chiang, W.-L., Liu, X., Si, S., Li, Y., Bengio, S.,
& Hsieh, C.-J. (2019). Cluster-GCN: An ef-
cient algorithm for training deep and large
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 469
graph convolutional networks. ACM SIGKDD
International Conference on Knowledge Dis-
covery & Data Mining, 257–266. 263, 264, 265
Child, R., Gray, S., Radford, A., & Sutskever, I.
(2019). Generating long sequences with sparse
transformers. arXiv:1904.10509. 237
Chintala, S., Denton, E., Arjovsky, M., & Matheiu,
M. (2020). How to train a GAN? Tips and
tricks to make GANs work. https://github.
com/soumith/ganhacks
. 299
Cho, K., van Merrienboer, B., Bahdanau, D., &
Bengio, Y. (2014). On the properties of neu-
ral machine translation: Encoder-decoder ap-
proaches. ACL Workshop on Syntax, Seman-
tics and Structure in Statistical Translation,
103–111. 233
Choi, D., Shallue, C. J., Nado, Z., Lee, J., Maddi-
son, C. J., & Dahl, G. E. (2019). On empir-
ical comparisons of optimizers for deep learn-
ing. arXiv:1910.05446. 94, 410
Choi, J., Kim, S., Jeong, Y., Gwon, Y., & Yoon, S.
(2021). ILVR: Conditioning method for denois-
ing diusion probabilistic models. IEEE/CVF
International Conference on Computer Vi-
sion, 14347–14356. 370
Choi, J., Lee, J., Shin, C., Kim, S., Kim, H., &
Yoon, S. (2022). Perception prioritized train-
ing of diusion models. IEEE/CVF Computer
Vision & Pattern Recognition, 11472–11481.
369
Choi, Y., Choi, M., Kim, M., Ha, J.-W., Kim, S., &
Choo, J. (2018). StarGAN: Unied generative
adversarial networks for multi-domain image-
to-image translation. IEEE/CVF Computer
Vision & Pattern Recognition, 8789–8797. 301
Chollet, F. (2017). Xception: Deep learning with
depthwise separable convolutions. IEEE/CVF
Computer Vision & Pattern Recognition,
1251–1258. 405
Choromanska, A., Hena, M., Mathieu, M., Arous,
G. B., & LeCun, Y. (2015). The loss surfaces of
multilayer networks. International Conference
on Articial Intelligence and Statistics. 405
Choromanski, K., Likhosherstov, V., Dohan, D.,
Song, X., Gane, A., Sarlos, T., Hawkins, P.,
Davis, J., Mohiuddin, A., Kaiser, L., et al.
(2020). Rethinking attention with Performers.
International Conference on Learning Repre-
sentations. 236, 237
Chorowski, J., & Jaitly, N. (2017). Towards better
decoding and language model integration in se-
quence to sequence models. INTERSPEECH,
523–527. 158
Chouldechova, A. (2017). Fair prediction with dis-
parate impact: A study of bias in recidivism
prediction instruments. Big data, 5(2), 153–
163. 422
Chowdhery, A., Narang, S., Devlin, J., Bosma, M.,
Mishra, G., Roberts, A., Barham, P., Chung,
H. W., Sutton, C., Gehrmann, S., et al. (2022).
PaLM: Scaling language modeling with path-
ways. arXiv:2204.02311. 234
Christian, B. (2020). The Alignment Problem: Ma-
chine Learning and Human Values. W. W.
Norton. 421
Christiano, P., Shlegeris, B., & Amodei, D. (2018).
Supervising strong learners by amplifying weak
experts. arXiv:1810.08575. 398
Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H.,
Wei, X., Xia, H., & Shen, C. (2021). Twins:
Revisiting the design of spatial attention in
vision transformers. Neural Information Pro-
cessing Systems, 34, 9355–9366. 238
Chung, H., Sim, B., & Ye, J. C. (2022). Come-
closer-diuse-faster: Accelerating conditional
diusion models for inverse problems through
stochastic contraction. IEEE/CVF Computer
Vision & Pattern Recognition, 12413–12422.
369
Chung, H., & Ye, J. C. (2022). Score-based dif-
fusion models for accelerated MRI. Medical
Image Analysis, 80, 102479. 369
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y.
(2014). Empirical evaluation of gated recurrent
neural networks on sequence modeling. Deep
Learning and Representation Workshop. 233
Chung, J., Kastner, K., Dinh, L., Goel, K.,
Courville, A. C., & Bengio, Y. (2015). A recur-
rent latent variable model for sequential data.
Neural Information Processing Systems, 28,
2980–2988. 344, 345
Çiçek, Ö., Abdulkadir, A., Lienkamp, S. S., Brox,
T., & Ronneberger, O. (2016). 3D U-Net:
Learning dense volumetric segmentation from
sparse annotation. International Conference
on Medical Image Computing and Computer-
Assisted Intervention, 424–432. 205
Clark, M. (2022). The engineer who
claimed a Google AI is sentient has
been red. The Verge, July 22, 2022.
https : / / www . theverge . com / 2022 / 7 / 22 /
23274958 / google - ai - engineer - blake -
lemoine-chatbot-lamda-2-sentience. 234
Clevert, D.-A., Unterthiner, T., & Hochreiter,
S. (2015). Fast and accurate deep network
learning by exponential linear units (ELUs).
arXiv:1511.07289. 38
Draft: please send errata to udlbookmail@gmail.com.
470 Bibliography
Cohen, J. M., Kaur, S., Li, Y., Kolter, J. Z., & Tal-
walkar, A. (2021). Gradient descent on neural
networks typically occurs at the edge of sta-
bility. International Conference on Learning
Representations. 157
Cohen, N., Sharir, O., & Shashua, A. (2016). On
the expressive power of deep learning: A ten-
sor analysis. PMLR Conference on Learning
Theory, 698–728. 53
Cohen, T., & Welling, M. (2016). Group equiv-
ariant convolutional networks. International
Conference on Machine Learning, 2990–2999.
183
Collins, E., Bala, R., Price, B., & Susstrunk, S.
(2020). Editing in style: Uncovering the lo-
cal semantics of GANs. IEEE/CVF Computer
Vision & Pattern Recognition, 5771–5780. 300
Conneau, A., Schwenk, H., Barrault, L., & Lecun,
Y. (2017). Very deep convolutional networks
for text classication. Meeting of the Asso-
ciation for Computational Linguistics, 1107–
1116. 182
Constanza-Chock, S. (2020). Design Justice:
Community-Led Practices to Build the Worlds
We Need. Cambridge, MA: The MIT Press.
433
Cordonnier, J.-B., Loukas, A., & Jaggi, M. (2020).
On the relationship between self-attention and
convolutional layers. International Conference
on Learning Representations. 236
Cordts, M., Omran, M., Ramos, S., Rehfeld, T.,
Enzweiler, M., Benenson, R., Franke, U., Roth,
S., & Schiele, B. (2016). The Cityscapes
dataset for semantic urban scene understand-
ing. IEEE/CVF Computer Vision & Pattern
Recognition, 1877–1901. 6, 153
Coulombe, C. (2018). Text data augmentation
made simple by leveraging NLP cloud APIs.
arXiv:1812.04718. 160
Creel, K. A. (2020). Transparency in complex
computational systems. Philosophy of Science,
87(4), 568–589. 425, 435
Crenshaw, K. (1991). Mapping the margins: In-
tersectionality, identity politics, and violence
against women of color. Stanford Law Review,
43(6), 1241–1299. 423
Creswell, A., & Bharath, A. A. (2018). Inverting
the generator of a generative adversarial net-
work. IEEE Transactions on Neural Networks
and Learning Systems, 30(7), 1967–1974. 301
Creswell, A., White, T., Dumoulin, V., Arulku-
maran, K., Sengupta, B., & Bharath, A. A.
(2018). Generative adversarial networks: An
overview. IEEE Signal Processing Magazine,
35(1), 53–65. 298
Cristianini, M., & Shawe-Taylor, J. (2000). An In-
troduction to support vector machines. CUP.
74
Croitoru, F.-A., Hondru, V., Ionescu, R. T., &
Shah, M. (2022). Diusion models in vision:
A survey. arXiv:2209.04747 . 369
Cubuk, E. D., Zoph, B., Mané, D., Vasude-
van, V., & Le, Q. V. (2019). Autoaug-
ment: Learning augmentation strategies from
data. IEEE/CVF Computer Vision & Pattern
Recognition, 113–123. 405
Cybenko, G. (1989). Approximation by superposi-
tions of a sigmoidal function. Mathematics of
Control, Signals and Systems, 2(4), 303–314.
38
Dabney, W., Rowland, M., Bellemare, M., &
Munos, R. (2018). Distributional reinforce-
ment learning with quantile regression. AAAI
Conference on Articial Intelligence. 397
Dai, H., Dai, B., & Song, L. (2016). Discrimina-
tive embeddings of latent variable models for
structured data. International Conference on
Machine Learning, 2702–2711. 262
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G.,
Hu, H., & Wei, Y. (2017). Deformable con-
volutional networks. IEEE/CVF International
Conference on Computer Vision, 764–773. 183
Daigavane, A., Balaraman, R., & Aggarwal,
G. (2021). Understanding convolutions on
graphs. Distill, https://distill.pub/2021/
understanding-gnns/. 261
Danaher, J. (2019). Automation and Utopia: Hu-
man Flourishing in a World without Work.
Harvard University Press. 430
Daniluk, M., Rocktäschel, T., Welbl, J., & Riedel,
S. (2017). Frustratingly short attention spans
in neural language modeling. International
Conference on Learning Representations. 235
Danks, D., & London, A. J. (2017). Algorith-
mic bias in autonomous systems. Interna-
tional Joint Conference on Articial Intelli-
gence, 4691–4697. 422
Dao, D. (2021). Awful AI. Github. Retrieved
January 17, 2023. https : / / github . com /
daviddao/awful-ai. 14
Dar, Y., Muthukumar, V., & Baraniuk, R. G.
(2021). A farewell to the bias-variance trade-
o? An overview of the theory of overparam-
eterized machine learning. arXiv:2109.02355.
135
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 471
Das, H. P., Abbeel, P., & Spanos, C. J.
(2019). Likelihood contribution based
multi-scale architecture for generative ows.
arXiv:1908.01686. 323
Dauphin, Y. N., Pascanu, R., Gülçehre, Ç., Cho,
K., Ganguli, S., & Bengio, Y. (2014). Iden-
tifying and attacking the saddle point prob-
lem in high-dimensional non-convex optimiza-
tion. Neural Information Processing Systems,
vol. 27, 2933–2941. 409, 410
David, H. (2015). Why are there still so many jobs?
The history and future of workplace automa-
tion. Journal of Economic Perspectives, 29(3),
3–30. 14
De, S., & Smith, S. (2020). Batch normalization bi-
ases residual blocks towards the identity func-
tion in deep networks. Neural Information
Processing Systems, 33, 19964–19975. 205
De Cao, N., & Kipf, T. (2018). MolGAN: An
implicit generative model for small molecular
graphs. ICML Workshop on Theoretical Foun-
dations and Applications of Deep Generative
Models. 299
Dechter, R. (1986). Learning while searching in
constraint-satisfaction-problems. AAAI Con-
ference on Articial Intelligence, 178––183. 52
Deerrard, M., Bresson, X., & Vandergheynst,
P. (2016). Convolutional neural networks on
graphs with fast localized spectral ltering.
Neural Information Processing Systems, 29,
3837–3845. 262
Dehghani, M., Tay, Y., Gritsenko, A. A., Zhao,
Z., Houlsby, N., Diaz, F., Metzler, D., &
Vinyals, O. (2021). The benchmark lottery.
arXiv:2107.07002. 234
Deisenroth, M. P., Faisal, A. A., & Ong, C. S.
(2020). Mathematics for machine learning.
Cambridge University Press. 15
Dempster, A. P., Laird, N. M., & Rubin, D. B.
(1977). Maximum likelihood from incomplete
data via the EM algorithm. Journal of the
Royal Statistical Society: Series B, 39(1), 1–
22. 346
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K.,
& Fei-Fei, L. (2009). ImageNet: A large-scale
hierarchical image database. IEEE Computer
Vision & Pattern Recognition, 248–255. 181,
272
Denton, E. L., Chintala, S., Fergus, R., et al.
(2015). Deep generative image models using
a Laplacian pyramid of adversarial networks.
Neural Information Processing Systems, 28,
1486–1494. 300, 301
Devlin, J., Chang, M., Lee, K., & Toutanova, K.
(2019). BERT: pre-training of deep bidirec-
tional transformers for language understand-
ing. ACL Human Language Technologies,
4171–4186. 159, 234
DeVries, T., & Taylor, G. W. (2017a). Dataset aug-
mentation in feature space. arXiv:1702.05538.
158
DeVries, T., & Taylor, G. W. (2017b). Improved
regularization of convolutional neural networks
with Cutout. arXiv:1708.04552. 183
Dhariwal, P., & Nichol, A. (2021). Diusion mod-
els beat GANs on image synthesis. Neural In-
formation Processing Systems, 34, 8780–8794.
367, 368, 370
Ding, M., Xiao, B., Codella, N., Luo, P., Wang,
J., & Yuan, L. (2022). DaViT: Dual attention
vision transformers. European Conference on
Computer Vision, 74–92. 238
Dinh, L., Krueger, D., & Bengio, Y. (2015). NICE:
Non-linear independent components estima-
tion. International Conference on Learning
Representations Workshop. 323
Dinh, L., Pascanu, R., Bengio, S., & Bengio, Y.
(2017). Sharp minima can generalize for deep
nets. International Conference on Machine
Learning, 1019–1028. 411
Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016).
Density estimation using Real NVP. Inter-
national Conference on Learning Representa-
tions. 322, 323
Dinh, L., Sohl-Dickstein, J., Larochelle, H., & Pas-
canu, R. (2019). A RAD approach to deep mix-
ture models. ICLR Workshop on Deep Gener-
ative Models for Highly Structured Data. 323
Dockhorn, T., Vahdat, A., & Kreis, K.
(2022). Score-based generative modeling with
critically-damped Langevin diusion. Inter-
national Conference on Learning Representa-
tions. 370
Doersch, C., Gupta, A., & Efros, A. A. (2015).
Unsupervised visual representation learning by
context prediction. IEEE International Con-
ference on Computer Vision, 1422–1430. 159
Domingos, P. (2000). A unied bias-variance de-
composition. International Conference on Ma-
chine Learning, 231–238. 133
Domke, J. (2010). Statistical machine learning.
https://people.cs.umass.edu/~domke/. 116
Donahue, C., Lipton, Z. C., Balsubramani, A., &
McAuley, J. (2018a). Semantically decompos-
ing the latent spaces of generative adversarial
Draft: please send errata to udlbookmail@gmail.com.
472 Bibliography
networks. International Conference on Learn-
ing Representations. 301
Donahue, C., McAuley, J., & Puckette, M.
(2018b). Adversarial audio synthesis. Inter-
national Conference on Learning Representa-
tions. 299, 301
Dong, X., Bao, J., Chen, D., Zhang, W.,
Yu, N., Yuan, L., Chen, D., & Guo, B.
(2022). CSWin transformer: A general vision
transformer backbone with cross-shaped win-
dows. IEEE/CVF Computer Vision & Pattern
Recognition, 12124–12134. 238
Dorta, G., Vicente, S., Agapito, L., Campbell,
N. D., & Simpson, I. (2018). Structured uncer-
tainty prediction networks. IEEE/CVF Com-
puter Vision & Pattern Recognition, 5477–
5485. 73, 340, 344
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weis-
senborn, D., Zhai, X., Unterthiner, T., De-
hghani, M., Minderer, M., Heigold, G., Gelly,
S., et al. (2021). An image is worth 16x16
words: Transformers for image recognition at
scale. International Conference on Learning
Representations. 234, 238
Dozat, T. (2016). Incorporating Nesterov momen-
tum into Adam. International Conference on
Learning Representations Workshop track.
94
Draxler, F., Veschgini, K., Salmhofer, M., & Ham-
precht, F. A. (2018). Essentially no barriers
in neural network energy landscape. Interna-
tional Conference on Machine Learning, 1308–
1317. 408, 409
Du, N., Huang, Y., Dai, A. M., Tong, S., Lep-
ikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu,
A. W., Firat, O., et al. (2022). GLaM: E-
cient scaling of language models with mixture-
of-experts. International Conference on Ma-
chine Learning, 5547–5569. 234
Du, S. S., Lee, J. D., Li, H., Wang, L., & Zhai, X.
(2019a). Gradient descent nds global minima
of deep neural networks. International Con-
ference on Machine Learning, 1675–1685. 404,
405
Du, S. S., Zhai, X., Poczos, B., & Singh, A.
(2019b). Gradient descent provably optimizes
over-parameterized neural networks. Inter-
national Conference on Learning Representa-
tions. 404
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive
subgradient methods for online learning and
stochastic optimization. Journal of Machine
Learning Research, 12, 2121–2159. 93
Dufter, P., Schmitt, M., & Schütze, H. (2021).
Position information in transformers: An
overview. Computational Linguistics, 1–31.
236
Dumoulin, V., Belghazi, I., Poole, B., Mastropi-
etro, O., Lamb, A., Arjovsky, M., & Courville,
A. (2017). Adversarially learned inference. In-
ternational Conference on Learning Represen-
tations. 301, 345
Dumoulin, V., & Visin, F. (2016). A guide
to convolution arithmetic for deep learning.
arXiv:1603.07285. 180
Dupont, E., Doucet, A., & Teh, Y. W. (2019). Aug-
mented neural ODEs. Neural Information Pro-
cessing Systems, 32, 3134–3144. 324
Durkan, C., Bekasov, A., Murray, I., & Pa-
pamakarios, G. (2019a). Cubic-spline ows.
ICML Invertible Neural Networks and Nor-
malizing Flows. 323
Durkan, C., Bekasov, A., Murray, I., & Papa-
makarios, G. (2019b). Neural spline ows.
Neural Information Processing Systems, 32,
7509–7520. 323
Duvenaud, D. K., Maclaurin, D., Iparraguirre, J.,
Bombarell, R., Hirzel, T., Aspuru-Guzik, A., &
Adams, R. P. (2015). Convolutional networks
on graphs for learning molecular ngerprints.
Neural Information Processing Systems, 28,
2224–2232. 262
D’Amour, A., Heller, K., Moldovan, D., Adlam, B.,
Alipanahi, B., Beutel, A., Chen, C., Deaton,
J., Eisenstein, J., Homan, M. D., et al.
(2020). Underspecication presents challenges
for credibility in modern machine learning.
Journal of Machine Learning Research, 1–61.
413
Ebrahimi, J., Rao, A., Lowd, D., & Dou, D. (2018).
HotFlip: White-box adversarial examples for
text classication. Meeting of the Association
for Computational Linguistics, 31–36. 160
El Asri, L., & Prince, J. D., Simon (2020). Tu-
torial #6: Neural natural language genera-
tion decoding algorithms. https://www.
borealisai.com/research- blogs/tutorial-
6 - neural - natural - language - generation -
decoding-algorithms/. 235
Eldan, R., & Shamir, O. (2016). The power of
depth for feedforward neural networks. PMLR
Conference on Learning Theory, 907–940. 53,
417
Elfwing, S., Uchibe, E., & Doya, K. (2018).
Sigmoid-weighted linear units for neural net-
work function approximation in reinforcement
learning. Neural Networks, 107 , 3–11. 38
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 473
Erasmus, A., Brunet, T. D. P., & Fisher, E. (2021).
What is interpretability? Philosophy & Tech-
nology, 34, 833–862. 425
Eren, L., Ince, T., & Kiranyaz, S. (2019). A generic
intelligent bearing fault diagnosis system using
compact adaptive 1D CNN classier. Journal
of Signal Processing Systems, 91(2), 179–189.
182
Erhan, D., Bengio, Y., Courville, A., & Vincent,
P. (2009). Visualizing higher-layer features of
a deep network. Technical Report, University
of Montreal, 1341(3). 184
Errica, F., Podda, M., Bacciu, D., & Micheli, A.
(2019). A fair comparison of graph neural net-
works for graph classication. International
Conference on Learning Representations. 262
Eslami, S., Heess, N., Weber, T., Tassa, Y., Szepes-
vari, D., Hinton, G. E., et al. (2016). Attend,
infer, repeat: Fast scene understanding with
generative models. Neural Information Pro-
cessing Systems, 29, 3225–3233. 344
Eslami, S. A., Jimenez Rezende, D., Besse, F., Vi-
ola, F., Morcos, A. S., Garnelo, M., Ruder-
man, A., Rusu, A. A., Danihelka, I., Gregor,
K., et al. (2018). Neural scene representation
and rendering. Science, 360(6394), 1204–1210.
344
Esling, P., Masuda, N., Bardet, A., Despres, R.,
et al. (2019). Universal audio synthesizer con-
trol with normalizing ows. International Con-
ference on Digital Audio Eects. 322
Esser, P., Rombach, R., & Ommer, B. (2021).
Taming transformers for high-resolution im-
age synthesis. IEEE/CVF Computer Vision
& Pattern Recognition, 12873–12883. 301
Esteves, C., Allen-Blanchette, C., Zhou, X., &
Daniilidis, K. (2018). Polar transformer net-
works. International Conference on Learning
Representations. 183
Etmann, C., Ke, R., & Schönlieb, C.-B. (2020).
iunets: Fully invertible U-Nets with learn-
able up-and downsampling. IEEE Interna-
tional Workshop on Machine Learning for Sig-
nal Processing. 322
Eubanks, V. (2018). Automating Inequality: How
High-Tech Tools Prole, Police, and Punish
the Poor. New York: St. Martin’s Press. 433
Evans, K., de Moura, N., Chauvier, S., Chatila, R.,
& Dogan, E. (2020). Ethical decision making
in autonomous vehicles: the AV ethics project.
Science and Engineering Ethics, 26(6), 3285–
3312. 424
FAIR (2022). Human-level play in the game of
Diplomacy by combining language models with
strategic reasoning. Science, 378(6624), 1067–
1074. 396
Falbo, A., & LaCroix, T. (2022). Est-ce que
vous compute? Code-switching, cultural iden-
tity, and AI. Feminist Philosophy Quarterly,
8(3/4). 423
Falk, T., Mai, D., Bensch, R., Çiçek, Ö., Abdulka-
dir, A., Marrakchi, Y., Böhm, A., Deubner, J.,
Jäckel, Z., Seiwald, K., et al. (2019). U-Net:
Deep learning for cell counting, detection, and
morphometry. Nature Methods, 16(1), 67–70.
199
Falkner, S., Klein, A., & Hutter, F. (2018). BOHB:
Robust and ecient hyperparameter optimiza-
tion at scale. International Conference on Ma-
chine Learning, 1437–1446. 136
Fallah, N., Gu, H., Mohammad, K., Seyyedsalehi,
S. A., Nourijelyani, K., & Eshraghian, M. R.
(2009). Nonlinear Poisson regression using
neural networks: A simulation study. Neural
Computing and Applications, 18(8), 939–943.
74
Fan, A., Lewis, M., & Dauphin, Y. N. (2018). Hi-
erarchical neural story generation. Meeting of
the Association for Computational Linguistics,
889–898. 235
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z.,
Malik, J., & Feichtenhofer, C. (2021). Multi-
scale vision transformers. IEEE/CVF Interna-
tional Conference on Computer Vision, 6824–
6835. 238
Fan, K., Li, B., Wang, J., Zhang, S., Chen, B.,
Ge, N., & Yan, Z. (2020). Neural zero-inated
quality estimation model for automatic speech
recognition system. Interspeech, 606–610. 73
Fang, F., Yamagishi, J., Echizen, I., & Lorenzo-
Trueba, J. (2018). High-quality nonparallel
voice conversion based on cycle-consistent ad-
versarial network. International Conference
on Acoustics, Speech and Signal Processing,
5279–5283. 299
Fang, Y., Liao, B., Wang, X., Fang, J., Qi, J., Wu,
R., Niu, J., & Liu, W. (2021). You only look at
one sequence: Rethinking transformer in vision
through object detection. Neural Information
Processing Systems, 34, 26183–26197. 238
Farnia, F., & Ozdaglar, A. (2020). Do GANs al-
ways have Nash equilibria? International Con-
ference on Machine Learning, 3029–3039. 299
Fawzi, A., Balog, M., Huang, A., Hubert, T.,
Romera-Paredes, B., Barekatain, M., Novikov,
Draft: please send errata to udlbookmail@gmail.com.
474 Bibliography
A., R Ruiz, F. J., Schrittwieser, J., Swirszcz,
G., et al. (2022). Discovering faster matrix
multiplication algorithms with reinforcement
learning. Nature, 610(7930), 47–53. 396
Fazelpour, S., & Danks, D. (2021). Algorithmic
bias: Senses, sources, solutions. Philosophy
Compass, 16. 421, 422, 435
Fedus, W., Goodfellow, I., & Dai, A. M. (2018).
MaskGAN: Better text generation via lling in
the_. International Conference on Learning
Representations. 299
Feng, S. Y., Gangal, V., Kang, D., Mitamura, T.,
& Hovy, E. (2020). GenAug: Data augmenta-
tion for netuning text generators. ACL Deep
Learning Inside Out, 29–42. 160
Feng, Z., Zhang, Z., Yu, X., Fang, Y., Li,
L., Chen, X., Lu, Y., Liu, J., Yin,
W., Feng, S., et al. (2022). ERNIE-
ViLG 2.0: Improving text-to-image diusion
model with knowledge-enhanced mixture-of-
denoising-experts. arXiv:2210.15257. 371
Fernandez, C. (2017). Can a computer tell if you’re
gay? Articial intelligence system guesses your
sexuality with 91% accuracy just by looking at
a photo of your face. Daily Mail, 7 Sept, 2017.
https://www.dailymail.co.uk/sciencetech/
article-4862676/Artificial-intelligence-
tell-gay.html. 430
Fernández-Madrigal, J.-A., & González, J. (2002).
Multihierarchical graph search. IEEE Trans-
actions on Pattern Analysis and Machine In-
telligence, 24(1), 103–113. 242
Fetscherin, M., Tantle-Dunn, S., & Klumb, A.
(2020). Eects of facial features and styling el-
ements on perceptions of competence, warmth,
and hireability of male professionals. The Jour-
nal of Social Psychology, 160(3), 332–345. 427
Finlay, C., Jacobsen, J., Nurbekyan, L., & Ober-
man, A. M. (2020). How to train your neural
ODE: The world of Jacobian and kinetic reg-
ularization. International Conference on Ma-
chine Learning, 3154–3164. 324
Fort, S., Hu, H., & Lakshminarayanan, B. (2019).
Deep ensembles: A loss landscape perspective.
arXiv:1912.02757. 158
Fort, S., & Jastrzębski, S. (2019). Large scale struc-
ture of neural network loss landscapes. Neu-
ral Information Processing Systems, vol. 32,
6706–6714. 408
Fort, S., & Scherlis, A. (2019). The Goldilocks
zone: Towards better understanding of neu-
ral network loss landscapes. AAAI Conference
on Articial Intelligence, 3574–3581. 409, 410,
412, 413
Fortunato, M., Azar, M. G., Piot, B., Menick, J.,
Osband, I., Graves, A., Mnih, V., Munos, R.,
Hassabis, D., Pietquin, O., et al. (2018). Noisy
networks for exploration. International Con-
ference on Learning Representations. 397
François-Lavet, V., Henderson, P., Islam, R.,
Bellemare, M. G., Pineau, J., et al. (2018). An
introduction to deep reinforcement learning.
Foundations and Trends in Machine Learning,
11(3-4), 219–354. 396
Frankle, J., & Carbin, M. (2019). The lottery ticket
hypothesis: Finding sparse, trainable neural
networks. International Conference on Learn-
ing Representations. 406, 415
Frankle, J., Dziugaite, G. K., Roy, D. M., &
Carbin, M. (2020). Linear mode connectivity
and the lottery ticket hypothesis. International
Conference on Machine Learning, 3259–3269.
158, 408
Frankle, J., Schwab, D. J., & Morcos, A. S. (2021).
Training BatchNorm and only BatchNorm: On
the expressive power of random features in
CNNs. International Conference on Learning
Representations. 418
Freund, Y., & Schapire, R. E. (1997). A decision-
theoretic generalization of on-line learning and
an application to boosting. Journal of Com-
puter and System Sciences, 55(1), 119–139. 74
Frey, C. B. (2019). The Technology Trap: Capital,
Labour, and Power in the Age of Automation.
Princeton University Press. 430
Frey, C. B., & Osborne, M. A. (2017). The fu-
ture of employment: How susceptible are jobs
to computerisation? Technological forecasting
and social change, 114, 254–280. 430
Friedman, J. H. (1997). On bias, variance, 0/1—
loss, and the curse-of-dimensionality. Data
Mining and Knowledge Discovery, 1(1), 55–77.
133
Fujimoto, S., Hoof, H., & Meger, D. (2018). Ad-
dressing function approximation error in actor-
critic methods. International Conference on
Machine Learning, 1587–1596. 397
Fujimoto, S., Meger, D., & Precup, D. (2019). O-
policy deep reinforcement learning without ex-
ploration. International Conference on Ma-
chine Learning, 2052–2062. 398
Fukushima, K. (1969). Visual feature extraction
by a multilayered network of analog threshold
elements. IEEE Transactions on Systems Sci-
ence and Cybernetics, 5(4), 322–333. 37
Fukushima, K., & Miyake, S. (1982). Neocogni-
tron: A self-organizing neural network model
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 475
for a mechanism of visual pattern recognition.
Competition and Cooperation in Neural Nets,
267–285. 180
Gabriel, I. (2020). Articial intelligence, values,
and alignment. Minds and Machines, 30, 411–
437. 421
Gal, Y., & Ghahramani, Z. (2016). Dropout as a
Bayesian approximation: Representing model
uncertainty in deep learning. International
Conference on Machine Learning, 1050––1059.
158
Gales, M. J. (1998). Maximum likelihood linear
transformations for HMM-based speech recog-
nition. Computer Speech & Language, 12(2),
75–98. 160
Gales, M. J., Ragni, A., AlDamarki, H., & Gau-
tier, C. (2009). Support vector machines for
noise robust ASR. 2009 IEEE Workshop on
Automatic Speech Recognition & Understand-
ing, 205–210. 160
Ganaie, M., Hu, M., Malik, A., Tanveer, M., &
Suganthan, P. (2022). Ensemble deep learning:
A review. Engineering Applications of Arti-
cial Intelligence, 115. 158
Gao, H., & Ji, S. (2019). Graph U-Nets. Interna-
tional Conference on Machine Learning, 2083–
2092. 265
Gao, R., Song, Y., Poole, B., Wu, Y. N., &
Kingma, D. P. (2021). Learning energy-based
models by diusion recovery likelihood. Inter-
national Conference on Learning Representa-
tions. 370
Garg, R., Bg, V. K., Carneiro, G., & Reid, I.
(2016). Unsupervised CNN for single view
depth estimation: Geometry to the rescue. Eu-
ropean Conference on Computer Vision, 740–
756. 205
Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov,
D., & Wilson, A. G. (2018). Loss surfaces,
mode connectivity, and fast ensembling of
DNNs. Neural Information Processing Sys-
tems, vol. 31, 8803––8812. 158, 408
Gastaldi, X. (2017a). Shake-shake regularization.
arXiv:1705.07485. 203
Gastaldi, X. (2017b). Shake-shake regularization
of 3-branch residual networks. 203
Gebru, T., Bender, E. M., McMillan-Major, A.,
& Mitchell, M. (2023). Statement from the
listed authors of stochastic parrots on the “AI
pause” letter. https://www.dair-institute.
org/blog/letter-statement-March2023. 435
Gemici, M. C., Rezende, D., & Mohamed, S.
(2016). Normalizing ows on Riemannian man-
ifolds. NIPS Workshop on Bayesian Deep
Learning. 324
Germain, M., Gregor, K., Murray, I., & Larochelle,
H. (2015). MADE: Masked autoencoder for
distribution estimation. International Confer-
ence on Machine Learning, 881–889. 323
Ghosh, A., Kulharia, V., Namboodiri, V. P.,
Torr, P. H., & Dokania, P. K. (2018).
Multi-agent diverse generative adversarial net-
works. IEEE/CVF Computer Vision & Pat-
tern Recognition, 8513–8521. 300
Gidaris, S., Singh, P., & Komodakis, N. (2018).
Unsupervised representation learning by pre-
dicting image rotations. International Confer-
ence on Learning Representations. 159
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals,
O., & Dahl, G. E. (2017). Neural message pass-
ing for quantum chemistry. International Con-
ference on Machine Learning, 1263–1272. 262
Girdhar, R., Carreira, J., Doersch, C., & Zisser-
man, A. (2019). Video action transformer net-
work. IEEE/CVF Computer Vision & Pattern
Recognition, 244–253. 238
Girshick, R. (2015). Fast R-CNN. IEEE Interna-
tional Conference on Computer Vision, 1440–
1448. 183
Girshick, R., Donahue, J., Darrell, T., & Malik,
J. (2014). Rich feature hierarchies for accu-
rate object detection and semantic segmenta-
tion. IEEE Computer Vision & Pattern Recog-
nition, 580–587. 183
Glorot, X., & Bengio, Y. (2010). Understanding
the diculty of training deep feedforward neu-
ral networks. International Conference on Ar-
ticial Intelligence and Statistics, 9, 249–256.
113, 183
Glorot, X., Bordes, A., & Bengio, Y. (2011).
Deep sparse rectier neural networks. Inter-
national Conference on Articial Intelligence
and Statistics, 315–323. 37, 38
Goh, G. (2017). Why momentum really works. Dis-
till, http://distill.pub/2017/momentum. 92
Goldberg, D. E. (1987). Simple genetic algorithms
and the minimal deceptive problem. Genetic
Algorithms and Simulated Annealing, 74–88.
Morgan Kaufmann. 421
Gomez, A. N., Ren, M., Urtasun, R., & Grosse,
R. B. (2017). The reversible residual network:
Backpropagation without storing activations.
Neural Information Processing Systems, 30,
2214–2224. 114, 322, 323
Draft: please send errata to udlbookmail@gmail.com.
476 Bibliography
Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D.,
Hernández-Lobato, J. M., Sánchez-Lengeling,
B., Sheberla, D., Aguilera-Iparraguirre, J.,
Hirzel, T. D., Adams, R. P., & Aspuru-Guzik,
A. (2018). Automatic chemical design us-
ing a data-driven continuous representation of
molecules. ACS Central Science, 4(2), 268–
276. 343, 344
Gong, S., Bahri, M., Bronstein, M. M., & Zafeiriou,
S. (2020). Geometrically principled connec-
tions in graph neural networks. IEEE/CVF
Computer Vision & Pattern Recognition,
11415–11424. 266
Goodfellow, I. (2016). Generative adversarial net-
works. NIPS 2016 Tutorial. 298
Goodfellow, I., Bengio, Y., & Courville, A. (2016).
Deep learning. MIT Press. 15, 157
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu,
B., Warde-Farley, D., Ozair, S., Courville, A.,
& Bengio, Y. (2014). Generative adversar-
ial networks. Communications of the ACM,
63(11), 139–144. 273, 298, 300
Goodfellow, I. J., Shlens, J., & Szegedy, C.
(2015a). Explaining and harnessing adversarial
examples. International Conference on Learn-
ing Representations. 159, 413
Goodfellow, I. J., Vinyals, O., & Saxe, A. M.
(2015b). Qualitatively characterizing neural
network optimization problems. International
Conference on Learning Representations. 407,
408
Goodin, D. (2023). ChatGPT is enabling script
kiddies to write functional malware. ars Tech-
nica, June 1, 2023. https://arstechnica.com/
information- technology/2023/01/chatgpt-
is - enabling - script - kiddies - to - write -
functional-malware/. 428
Gordon, G. J. (1995). Stable tted reinforcement
learning. Neural Information Processing Sys-
tems, 8, 1052–1058. 396
Gori, M., Monfardini, G., & Scarselli, F. (2005).
A new model for learning in graph domains.
IEEE International Joint Conference on Neu-
ral Networks, 2005, 729–734. 262
Gouk, H., Frank, E., Pfahringer, B., & Cree, M. J.
(2021). Regularisation of neural networks by
enforcing Lipschitz continuity. Machine Learn-
ing, 110(2), 393––416. 156
Goyal, A., Bochkovskiy, A., Deng, J., & Koltun, V.
(2021). Non-deep networks. arXiv:2110.07641.
417
Goyal, P., Dollár, P., Girshick, R., Noordhuis,
P., Wesolowski, L., Kyrola, A., Tulloch, A.,
Jia, Y., & He, K. (2018). Accurate, large
minibatch SGD: Training ImageNet in 1 hour.
arXiv:1706.02677. 92, 93, 237, 410
Graesser, L., & Keng, W. L. (2019). Foundations of
deep reinforcement learning. Addison-Wesley
Professional. 16, 396
Grathwohl, W., Chen, R. T., Bettencourt, J.,
Sutskever, I., & Duvenaud, D. (2019). Ffjord:
Free-form continuous dynamics for scalable re-
versible generative models. International Con-
ference on Learning Representations. 324
Grattarola, D., Zambon, D., Bianchi, F. M., &
Alippi, C. (2022). Understanding pooling in
graph neural networks. IEEE Transactions on
Neural Networks and Learning Systems. 265
Green, B. (2019). “Good” isn’t good enough.
NeurIPS Workshop on AI for Social Good. 433
Green, B. (2022). Escaping the impossibility of
fairness: From formal to substantive algorith-
mic fairness. Philosophy & Technology, 35(90).
422
Greensmith, E., Bartlett, P. L., & Baxter, J.
(2004). Variance reduction techniques for
gradient estimates in reinforcement learning.
Journal of Machine Learning Research, 5(9),
1471–1530. 397
Gregor, K., Besse, F., Jimenez Rezende, D., Dani-
helka, I., & Wierstra, D. (2016). Towards con-
ceptual compression. Neural Information Pro-
cessing Systems, 29, 3549–3557. 343, 344
Gregor, K., Papamakarios, G., Besse, F., Buesing,
L., & Weber, T. (2019). Temporal dierence
variational auto-encoder. International Con-
ference on Learning Representations. 344
Grennan, L., Kremer, A., Singla, A., & Zipparo,
P. (2022). Why businesses need explainable
AI—and how to deliver it. McKinsey, Septem-
ber 29, 2022. https :// www. mckinsey.com /
capabilities/quantumblack /our- insights/
why-businesses-need-explainable-ai-and-
how-to-deliver-it/. 13
Greydanus, S. (2020). Scaling down deep learning.
arXiv:2011.14439. 119
Griewank, A., & Walther, A. (2008). Evaluating
derivatives: Principles and techniques of algo-
rithmic dierentiation. SIAM. 113
Gu, J., Kwon, H., Wang, D., Ye, W., Li, M.,
Chen, Y.-H., Lai, L., Chandra, V., & Pan,
D. Z. (2022). Multi-scale high-resolution
vision transformer for semantic segmenta-
tion. IEEE/CVF Computer Vision & Pattern
Recognition, 12094–12103. 238
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 477
Guan, S., Tai, Y., Ni, B., Zhu, F., Huang,
F., & Yang, X. (2020). Collaborative
learning for faster StyleGAN embedding.
arXiv:2007.01758. 301
Gui, J., Sun, Z., Wen, Y., Tao, D., & Ye, J. (2021).
A review on generative adversarial networks:
Algorithms, theory, and applications. IEEE
Transactions on Knowledge and Data Engi-
neering. 299
Guimaraes, G. L., Sanchez-Lengeling, B., Out-
eiral, C., Farias, P. L. C., & Aspuru-Guzik, A.
(2017). Objective-reinforced generative adver-
sarial networks (ORGAN) for sequence gener-
ation models. arXiv:1705.10843. 299
Gulrajani, I., Kumar, K., Ahmed, F., Taiga, A. A.,
Visin, F., Vazquez, D., & Courville, A. (2016).
PixelVAE: A latent variable model for natural
images. International Conference on Learning
Representations. 299, 343, 344, 345
Ha, D., Dai, A., & Le, Q. V. (2017). Hypernet-
works. International Conference on Learning
Representations. 235
Haarnoja, T., Hartikainen, K., Abbeel, P., &
Levine, S. (2018a). Latent space policies for
hierarchical reinforcement learning. Interna-
tional Conference on Machine Learning, 1851–
1860. 322
Haarnoja, T., Zhou, A., Abbeel, P., & Levine,
S. (2018b). Soft actor-critic: O-policy maxi-
mum entropy deep reinforcement learning with
a stochastic actor. International Conference
on Machine Learning, 1861–1870. 398
Hagendor, T. (2020). The ethics of AI ethics: An
evaluation of guidelines. Minds and Machines,
30(1), 99–120. 420
Hamilton, W., Ying, Z., & Leskovec, J. (2017a).
Inductive representation learning on large
graphs. Neural Information Processing Sys-
tems, 30, 1024–1034. 262, 263, 264, 265, 267
Hamilton, W. L. (2020). Graph representation
learning. Synthesis Lectures on Artical Intel-
ligence and Machine Learning, 14(3), 1–159.
15, 261
Hamilton, W. L., Ying, R., & Leskovec, J. (2017b).
Representation learning on graphs: Methods
and applications. IEEE Data Engineering Bul-
letin, 40(3), 52–74. 263
Han, S., Mao, H., & Dally, W. J. (2016). Deep
compression: Compressing deep neural net-
works with pruning, trained quantization and
Human coding. International Conference on
Learning Representations. 414, 415
Han, S., Pool, J., Tran, J., & Dally, W. (2015).
Learning both weights and connections for e-
cient neural network. Neural Information Pro-
cessing Systems, vol. 28, 1135–1143. 414
Hannun, A. Y., Case, C., Casper, J., Catan-
zaro, B., Diamos, G., Elsen, E., Prenger, R.,
Satheesh, S., Sengupta, S., Coates, A., & Ng,
A. Y. (2014). Deep speech: Scaling up end-to-
end speech recognition. arXiv:1412.5567. 160
Hanson, S. J., & Pratt, L. Y. (1988). Comparing
biases for minimal network construction with
back-propagation. Neural Information Pro-
cessing Systems, vol. 2, 177––185. 155
Harding, S. (1986). The Science Question in Fem-
inism. Cornell University Press. 433
Härkönen, E., Hertzmann, A., Lehtinen, J., &
Paris, S. (2020). GANSpace: Discovering in-
terpretable GAN controls. Neural Information
Processing Systems, 33, 9841–9850. 300
Hartmann, K. G., Schirrmeister, R. T., & Ball,
T. (2018). EEG-GAN: Generative adversar-
ial networks for electroencephalograhic (EEG)
brain signals. arXiv:1806.01875. 299
Harvey, W., Naderiparizi, S., Masrani, V., Weil-
bach, C., & Wood, F. (2022). Flexible diusion
modeling of long videos. Neural Information
Processing Systems, 35. 369
Hasanzadeh, A., Hajiramezanali, E., Boluki, S.,
Zhou, M., Dueld, N., Narayanan, K., & Qian,
X. (2020). Bayesian graph neural networks
with adaptive connection sampling. Interna-
tional Conference on Machine Learning, 4094–
4104. 265
Hassibi, B., & Stork, D. G. (1993). Second or-
der derivatives for network pruning: Optimal
brain surgeon. Neural Information Processing
Systems, vol. 6, 164–171. 414
Hausknecht, M., & Stone, P. (2015). Deep recur-
rent Q-learning for partially observable MDPs.
AAAI Fall Symposia, 29–37. 397
Hayou, S., Clerico, E., He, B., Deligiannidis, G.,
Doucet, A., & Rousseau, J. (2021). Stable
ResNet. International Conference on Articial
Intelligence and Statistics, 1324–1332. 205
He, F., Liu, T., & Tao, D. (2019). Control batch
size and learning rate to generalize well: The-
oretical and empirical evidence. Neural Infor-
mation Processing Systems, 32, 1143–1152. 92,
410, 411
He, J., Neubig, G., & Berg-Kirkpatrick, T. (2018).
Unsupervised learning of syntactic structure
Draft: please send errata to udlbookmail@gmail.com.
478 Bibliography
with invertible neural projections. ACL Em-
pirical Methods in Natural Language Process-
ing, 1292–1302. 322
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delv-
ing deep into rectiers: Surpassing human-
level performance on ImageNet classication.
IEEE International Conference on Computer
Vision, 1026–1034. 38, 113, 183
He, K., Zhang, X., Ren, S., & Sun, J. (2016a).
Deep residual learning for image recogni-
tion. IEEE/CVF Computer Vision & Pattern
Recognition, 770–778. 188, 201, 323, 405
He, K., Zhang, X., Ren, S., & Sun, J. (2016b).
Identity mappings in deep residual networks.
European Conference on Computer Vision,
630–645. 202, 405
He, P., Liu, X., Gao, J., & Chen, W. (2021). De-
BERTa: Decoding-enhanced BERT with dis-
entangled attention. International Conference
on Learning Representations. 236
He, X., Haari, G., & Norouzi, M. (2020).
Dynamic programming encoding for subword
segmentation in neural machine translation.
Meeting of the Association for Computational
Linguistics, 3042–3051. 234
He, Y., Zhang, X., & Sun, J. (2017). Channel
pruning for accelerating very deep neural net-
works. IEEE/CVF International Conference
on Computer Vision, 1389–1397. 414
Heess, N., Wayne, G., Silver, D., Lillicrap, T.,
Erez, T., & Tassa, Y. (2015). Learning contin-
uous control policies by stochastic value gradi-
ents. Neural Information Processing Systems,
28, 2944–2952. 344
Heikkilä, M. (2022). Why business is booming for
military AI startups. MIT Technology Review,
July 7 2022. https://www.technologyreview.
com/2022/07/07/1055526/why-business-is-
booming- for- military- ai - startups/. 13,
427
Hena, M., Bruna, J., & LeCun, Y. (2015). Deep
convolutional networks on graph-structured
data. arXiv:1506.05163. 262
Henderson, P., Li, X., Jurafsky, D., Hashimoto, T.,
Lemley, M. A., & Liang, P. (2023). Foundation
models and fair use. arXiv:2303.15715. 428
Hendrycks, D., & Gimpel, K. (2016). Gaussian
error linear units (GELUs). arXiv:1606.08415.
38
Hermann, V. (2017). Wasserstein GAN
and the Kantorovich-Rubinstein duality.
https://vincentherrmann.github.io/blog/
wasserstein/. 284, 299
Hernández, C. X., Wayment-Steele, H. K., Sul-
tan, M. M., Husic, B. E., & Pande, V. S.
(2018). Variational encoding of complex dy-
namics. Physical Review E, 97(6), 062412. 344
Hertz, A., Mokady, R., Tenenbaum, J., Aber-
man, K., Pritch, Y., & Cohen-Or, D. (2022).
Prompt-to-prompt image editing with cross at-
tention control. arXiv:2208.01626. 369
Hessel, M., Modayil, J., van Hasselt, H., Schaul,
T., Ostrovski, G., Dabney, W., Horgan, D.,
Piot, B., Azar, M., & Silver, D. (2018). Rain-
bow: Combining improvements in deep rein-
forcement learning. AAAI Conference on Ar-
ticial Intelligence, 3215–3222. 397
Heusel, M., Ramsauer, H., Unterthiner, T.,
Nessler, B., & Hochreiter, S. (2017). GANs
trained by a two time-scale update rule con-
verge to a local Nash equilibrium. Neural In-
formation Processing Systems, 30, 6626–6637.
274
Heyns, C. (2017). Autonomous weapons in armed
conict and the right to a dignied life: An
African perspective. South African Journal of
Human Rights, 33(1), 46–71. 429
Higgins, I., Matthey, L., Pal, A., Burgess, C.,
Glorot, X., Botvinick, M., Mohamed, S., &
Lerchner, A. (2017). Beta-VAE: Learning ba-
sic visual concepts with a constrained varia-
tional framework. International Conference on
Learning Representations. 345
Himmelreich, J. (2022). Against ‘democratizing
AI’. AI & Society. 435
Hindupur, A. (2022). The GAN zoo. GitHub
Retrieved January 17, 2023. https://github.
com/hindupuravinash/the-gan-zoo. 299
Hinton, G., Srivastava, N., & Swersky, K. (2012a).
Neural networks for machine learning: Lec-
ture 6a Overview of mini-batch gradient
descent. https : / / www . cs . toronto . edu /
~tijmen / csc321 / slides / lecture _ slides _
lec6.pdf. 93
Hinton, G., & van Camp, D. (1993). Keeping neu-
ral networks simple by minimising the descrip-
tion length of weights. Computational learning
theory, 5–13. 159
Hinton, G., Vinyals, O., Dean, J., et al. (2015).
Distilling the knowledge in a neural network.
arXiv:1503.02531, 2(7). 415
Hinton, G. E., & Salakhutdinov, R. R. (2006). Re-
ducing the dimensionality of data with neural
networks. Science, 313(5786), 504–507. 344
Hinton, G. E., Srivastava, N., Krizhevsky,
A., Sutskever, I., & Salakhutdinov, R. R.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 479
(2012b). Improving neural networks by pre-
venting co-adaptation of feature detectors.
arXiv:1207.0580. 158
Ho, J., Chen, X., Srinivas, A., Duan, Y., & Abbeel,
P. (2019). Flow++: Improving ow-based gen-
erative models with variational dequantization
and architecture design. International Confer-
ence on Machine Learning, 2722–2730. 322,
323
Ho, J., Jain, A., & Abbeel, P. (2020). Denois-
ing diusion probabilistic models. Neural In-
formation Processing Systems, 33, 6840–6851.
274, 367, 369
Ho, J., Saharia, C., Chan, W., Fleet, D. J.,
Norouzi, M., & Salimans, T. (2022a). Cas-
caded diusion models for high delity image
generation. Journal of Machine Learning Re-
search, 23, 47–1. 369, 370
Ho, J., & Salimans, T. (2022). Classier-free dif-
fusion guidance. NeurIPS Workshop on Deep
Generative Models and Downstream Applica-
tions. 370
Ho, J., Salimans, T., Gritsenko, A., Chan, W.,
Norouzi, M., & Fleet, D. J. (2022b). Video
diusion models. International Conference on
Learning Representations. 369
Hochreiter, S., & Schmidhuber, J. (1997a). Flat
minima. Neural Computation, 9(1), 1–42. 411
Hochreiter, S., & Schmidhuber, J. (1997b). Long
short-term memory. Neural Computation,
9(8), 1735–1780. 233
Hoer, E., Hubara, I., & Soudry, D. (2017). Train
longer, generalize better: Closing the general-
ization gap in large batch training of neural
networks. Neural Information Processing Sys-
tems, 30, 1731–1741. 203, 204
Homan, M. D., & Johnson, M. J. (2016). ELBO
surgery: Yet another way to carve up the vari-
ational evidence lower bound. NIPS Workshop
in Advances in Approximate Bayesian Infer-
ence, 2. 346
Homann, J., Borgeaud, S., Mensch, A.,
Buchatskaya, E., Cai, T., Rutherford, E.,
Casas, D. d. L., Hendricks, L. A., Welbl,
J., Clark, A., et al. (2023). Train-
ing compute-optimal large language models.
arXiv:2203.15556. 234
Hofstadter, D. R. (1995). The ineradicable Eliza ef-
fect and its dangers (preface 4). Fluid Concepts
and Creative Analogies: Computer Models Of
The Fundamental Mechanisms Of Thought,
155–168. Basic Books. 428
Holland, C. A., Ebner, N. C., Lin, T., & Samanez-
Larkin, G. R. (2019). Emotion identica-
tion across adulthood using the dynamic faces
database of emotional expressions in younger,
middle aged, and older adults. Cognition and
Emotion, 33(2), 245–257. 9
Holtzman, A., Buys, J., Du, L., Forbes, M., &
Choi, Y. (2020). The curious case of neural
text degeneration. International Conference
on Learning Representations. 235
Hoogeboom, E., Nielsen, D., Jaini, P., Forré, P., &
Welling, M. (2021). Argmax ows and multi-
nomial diusion: Learning categorical distri-
butions. Neural Information Processing Sys-
tems, 34, 12454–12465. 369
Hoogeboom, E., Peters, J., Van Den Berg, R., &
Welling, M. (2019a). Integer discrete ows and
lossless compression. Neural Information Pro-
cessing Systems, 32, 12134–12144. 324
Hoogeboom, E., Van Den Berg, R., & Welling, M.
(2019b). Emerging convolutions for generative
normalizing ows. International Conference
on Machine Learning, 2771–2780. 322
Höppe, T., Mehrjou, A., Bauer, S., Nielsen, D.,
& Dittadi, A. (2022). Diusion models for
video prediction and inlling. ECCV Work-
shop on AI for Creative Video Editing and
Understanding. 369
Hornik, K. (1991). Approximation capabilities of
multilayer feedforward networks. Neural Net-
works, 4(2), 251–257. 38
Howard, A., Sandler, M., Chu, G., Chen, L.-C.,
Chen, B., Tan, M., Wang, W., Zhu, Y., Pang,
R., Vasudevan, V., et al. (2019). Searching for
MobileNetV3. IEEE/CVF International Con-
ference on Computer Vision, 1314–1324. 38
Howard, A. G., Zhu, M., Chen, B., Kalenichenko,
D., Wang, W., Weyand, T., Andreetto, M., &
Adam, H. (2017). MobileN ets: Ecient con-
volutional neural networks for mobile vision
applications. arXiv:1704.04861. 181
Howard, R. A. (1960). Dynamic programming and
Narkov processes. Wiley. 396
Hsu, C.-C., Hwang, H.-T., Wu, Y.-C., Tsao, Y.,
& Wang, H.-M. (2017a). Voice conversion
from unaligned corpora using variational au-
toencoding Wasserstein generative adversarial
networks. INTERSPEECH, 3364–3368. 345
Hsu, W.-N., Zhang, Y., & Glass, J. (2017b). Learn-
ing latent representations for speech genera-
tion and transformation. INTERSPEECH,
1273–1277. 343
Draft: please send errata to udlbookmail@gmail.com.
480 Bibliography
Hu, H., Gu, J., Zhang, Z., Dai, J., & Wei, Y.
(2018a). Relation networks for object detec-
tion. IEEE/CVF Computer Vision & Pattern
Recognition, 3588–3597. 238
Hu, H., Zhang, Z., Xie, Z., & Lin, S. (2019).
Local relation networks for image recogni-
tion. IEEE/CVF International Conference on
Computer Vision, 3464–3473. 238
Hu, J., Shen, L., & Sun, G. (2018b). Squeeze-and-
excitation networks. IEEE/CVF Computer
Vision & Pattern Recognition, 7132–7141. 181,
235
Hu, W., Pang, J., Liu, X., Tian, D., Lin, C.-W., &
Vetro, A. (2022). Graph signal processing for
geometric data and beyond: Theory and ap-
plications. IEEE Transactions on Multimedia,
24, 3961–3977. 242
Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R.,
& Xing, E. P. (2017). Toward controlled gen-
eration of text. International Conference on
Machine Learning, 1587–1596. 343
Huang, C.-W., Krueger, D., Lacoste, A., &
Courville, A. (2018a). Neural autoregressive
ows. International Conference on Machine
Learning, 2078–2087. 323, 324
Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft,
J. E., & Weinberger, K. Q. (2017a). Snap-
shot ensembles: Train 1, get M for free. Inter-
national Conference on Learning Representa-
tions. 158
Huang, G., Liu, Z., Van Der Maaten, L., & Wein-
berger, K. Q. (2017b). Densely connected con-
volutional networks. IEEE/CVF Computer
Vision & Pattern Recognition, 4700–4708. 205,
405
Huang, G., Sun, Y., Liu, Z., Sedra, D., & Wein-
berger, K. Q. (2016). Deep networks with
stochastic depth. European Conference on
Computer Vision, 646–661. 202
Huang, W., Zhang, T., Rong, Y., & Huang, J.
(2018b). Adaptive sampling towards fast graph
representation learning. Neural Information
Processing Systems, 31, 4563–4572. 264, 265
Huang, X., Li, Y., Poursaeed, O., Hopcroft, J., &
Belongie, S. (2017c). Stacked generative adver-
sarial networks. IEEE/CVF Computer Vision
& Pattern Recognition, 5077–5086. 300
Huang, X. S., Perez, F., Ba, J., & Volkovs, M.
(2020a). Improving transformer optimization
through better initialization. International
Conference on Machine Learning, 4475–4483.
114, 237
Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen,
D., Chen, M., Lee, H., Ngiam, J., Le, Q. V.,
Wu, Y., et al. (2019). GPipe: Ecient train-
ing of giant neural networks using pipeline par-
allelism. Neural Information Processing Sys-
tems, 32, 103–112. 114
Huang, Z., Liang, D., Xu, P., & Xiang, B. (2020b).
Improve transformer models with better rela-
tive position embeddings. Empirical Methods
in Natural Language Processing. 236
Huang, Z., & Wang, N. (2018). Data-driven sparse
structure selection for deep neural networks.
European Conference on Computer Vision,
304–320. 414
Hubinger, E., van Merwijk, C., Mikulik, V., Skalse,
J., & Garrabrant, S. (2019). Risks from learned
optimization in advanced machine learning
systems. arXiv:1906.01820. 421
Hussein, A., Gaber, M. M., Elyan, E., & Jayne, C.
(2017). Imitation learning: A survey of learn-
ing methods. ACM Computing Surveys, 50(2),
1–35. 398
Huszár, F. (2019). Exponentially growing learn-
ing rate? Implications of scale invariance
induced by batch normalization. https://
www.inference.vc/exponentially- growing-
learning - rate - implications - of - scale -
invariance-induced-by-BatchNorm/. 204
Hutchinson, M. F. (1989). A stochastic estimator
of the trace of the inuence matrix for Lapla-
cian smoothing splines. Communications in
Statistics-Simulation and Computation, 18(3),
1059–1076. 324
Hutter, F., Hoos, H. H., & Leyton-Brown, K.
(2011). Sequential model-based optimization
for general algorithm conguration. Interna-
tional Conference on Learning and Intelligent
Optimization, 507–523. 136
Iglovikov, V., & Shvets, A. (2018). Ter-
nausNet: U-Net with VGG11 encoder pre-
trained on ImageNet for image segmentation.
arXiv:1801.05746. 205
Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L.,
Tran, B., & Madry, A. (2019). Adversarial ex-
amples are not bugs, they are features. Neural
Information Processing Systems, 32, 125–136.
414
Inoue, H. (2018). Data augmentation by
pairing samples for images classication.
arXiv:1801.02929. 159
Inoue, T., Choudhury, S., De Magistris, G., & Das-
gupta, S. (2018). Transfer learning from syn-
thetic to real images using variational autoen-
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 481
coders for precise position detection. IEEE In-
ternational Conference on Image Processing,
2725–2729. 344
Ioe, S. (2017). Batch renormalization: Towards
reducing minibatch dependence in batch-
normalized models. Neural Information Pro-
cessing Systems, 30, 1945–1953. 203
Ioe, S., & Szegedy, C. (2015). Batch normaliza-
tion: Accelerating deep network training by re-
ducing internal covariate shift. International
Conference on Machine Learning, 448–456.
114, 203, 204
Ishida, T., Yamane, I., Sakai, T., Niu, G., &
Sugiyama, M. (2020). Do we need zero train-
ing loss after achieving zero training error? In-
ternational Conference on Machine Learning,
4604–4614. 134, 159
Isola, P., Zhu, J.-Y., Zhou, T., & Efros, A. A.
(2017). Image-to-image translation with condi-
tional adversarial networks. IEEE/CVF Com-
puter Vision & Pattern Recognition, 1125–
1134. 205, 293, 301
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov,
D., & Wilson, A. G. (2018). Averaging weights
leads to wider optima and better generaliza-
tion. Uncertainly in Articial Intelligence,
876–885. 158, 411
Jackson, P. T., Abarghouei, A. A., Bonner, S.,
Breckon, T. P., & Obara, B. (2019). Style aug-
mentation: Data augmentation via style ran-
domization. IEEE Computer Vision and Pat-
tern Recognition Workshops, 10–11. 159
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hin-
ton, G. E. (1991). Adaptive mixtures of local
experts. Neural Computation, 3(1), 79–87. 73
Jacobsen, J.-H., Smeulders, A., & Oyallon, E.
(2018). i-RevNet: Deep invertible networks.
International Conference on Learning Repre-
sentations. 322, 323
Jaini, P., Kobyzev, I., Yu, Y., & Brubaker, M. A.
(2020). Tails of Lipschitz triangular ows. In-
ternational Conference on Machine Learning,
4673–4681. 324
Jaini, P., Selby, K. A., & Yu, Y. (2019). Sum-of-
squares polynomial ow. International Con-
ference on Machine Learning, 3009–3018. 323
Jaitly, N., & Hinton, G. E. (2013). Vocal tract
length perturbation (VTLP) improves speech
recognition. ICML Workshop on Deep Learn-
ing for Audio, Speech and Language. 160
Jarrett, K., Kavukcuoglu, K., Ranzato, M., & Le-
Cun, Y. (2009). What is the best multi-stage
architecture for object recognition? IEEE In-
ternational Conference on Computer Vision,
2146–2153. 37
Jastrzębski, S., Arpit, D., Astrand, O., Kerg,
G. B., Wang, H., Xiong, C., Socher, R., Cho,
K., & Geras, K. J. (2021). Catastrophic sher
explosion: Early phase sher matrix impacts
generalization. International Conference on
Machine Learning, 4772–4784. 157
Jastrzębski, S., Kenton, Z., Arpit, D., Ballas,
N., Fischer, A., Bengio, Y., & Storkey, A.
(2018). Three factors inuencing minima in
SGD. arXiv:1711.04623. 92, 410
Ji, S., Xu, W., Yang, M., & Yu, K. (2012). 3D
convolutional neural networks for human ac-
tion recognition. IEEE Transactions on Pat-
tern Analysis & Machine Intelligence, 35(1),
221–231. 182
Jia, X., De Brabandere, B., Tuytelaars, T., & Gool,
L. V. (2016). Dynamic lter networks. Neural
Information Processing Systems, 29. 183
Jiang, Z., Zheng, Y., Tan, H., Tang, B., & Zhou,
H. (2016). Variational deep embedding: An
unsupervised and generative approach to clus-
tering. International Joint Conference on Ar-
ticial Intelligence, 1965–1972. 344
Jin, C., Netrapalli, P., & Jordan, M. (2020). What
is local optimality in nonconvex-nonconcave
minimax optimization? International Confer-
ence on Machine Learning, 4880–4889. 299
Jin, L., Doshi-Velez, F., Miller, T., Schwartz, L.,
& Schuler, W. (2019). Unsupervised learning
of PCFGs with normalizing ow. Meeting of
the Association for Computational Linguistics,
2442–2452. 322
Jing, L., & Tian, Y. (2020). Self-supervised visual
feature learning with deep neural networks: A
survey. IEEE Transactions on Pattern Analy-
sis & Machine Intelligence, 43(11), 4037–4058.
159
Jobin, A., Ienca, M., & Vayena, E. (2019). The
global landscape of AI ethics guidelines. Na-
ture Machine Intelligence, 1, 389–399. 420
Johnson, G. M. (2022). Are algorithms value-free?
feminist theoretical virtues in machine learn-
ing. 198. 432
Johnson, R., & Zhang, T. (2013). Accelerat-
ing stochastic gradient descent using predictive
variance reduction. Neural Information Pro-
cessing Systems, 26, 315–323. 91
Jolicoeur-Martineau, A. (2019). The relativistic
discriminator: A key element missing from
Draft: please send errata to udlbookmail@gmail.com.
482 Bibliography
standard GAN. International Conference on
Learning Representations. 299
Jurafsky, D., & Martin, J. H. (2000). Speech and
Language Processing, 2nd Edition. Pearson.
233
Kakade, S. M. (2001). A natural policy gradient.
Neural Information Processing Systems, 14,
1531–1538. 397
Kanazawa, A., Sharma, A., & Jacobs, D. (2014).
Locally scale-invariant convolutional neural
networks. Neural Information Processing Sys-
tems Workshop. 183
Kanda, N., Takeda, R., & Obuchi, Y. (2013). Elas-
tic spectral distortion for low resource speech
recognition with deep neural networks. IEEE
Workshop on Automatic Speech Recognition
and Understanding, 309–314. 160
Kaneko, T., & Kameoka, H. (2017). Parallel-data-
free voice conversion using cycle-consistent ad-
versarial networks. arXiv:1711.11293. 299
Kang, G., Dong, X., Zheng, L., & Yang,
Y. (2017). PatchShue regularization.
arXiv:1707.07103. 159
Kanwar, G., Albergo, M. S., Boyda, D., Cranmer,
K., Hackett, D. C., Racaniere, S., Rezende,
D. J., & Shanahan, P. E. (2020). Equivariant
ow-based sampling for lattice gauge theory.
Physical Review Letters, 125(12), 121601. 322
Karras, T., Aila, T., Laine, S., & Lehtinen, J.
(2018). Progressive growing of GANs for im-
proved quality, stability, and variation. Inter-
national Conference on Learning Representa-
tions. 286, 287, 299, 300, 319, 345
Karras, T., Aittala, M., Aila, T., & Laine,
S. (2022). Elucidating the design space of
diusion-based generative models. Neural In-
formation Processing Systems. 369, 370
Karras, T., Aittala, M., Hellsten, J., Laine, S.,
Lehtinen, J., & Aila, T. (2020a). Training gen-
erative adversarial networks with limited data.
Neural Information Processing Systems, 33,
12104–12114. 300
Karras, T., Aittala, M., Laine, S., Härkönen,
E., Hellsten, J., Lehtinen, J., & Aila, T.
(2021). Alias-free generative adversarial net-
works. Neural Information Processing Sys-
tems, 34, 852–863. 300
Karras, T., Laine, S., & Aila, T. (2019). A style-
based generator architecture for generative ad-
versarial networks. IEEE/CVF Computer Vi-
sion & Pattern Recognition, 4401–4410. 299,
300
Karras, T., Laine, S., Aittala, M., Hellsten, J.,
Lehtinen, J., & Aila, T. (2020b). Analyz-
ing and improving the image quality of Style-
GAN. IEEE/CVF Computer Vision & Pattern
Recognition, 8110–8119. 8, 300, 301
Katharopoulos, A., Vyas, A., Pappas, N., &
Fleuret, F. (2020). Transformers are RNNs:
Fast autoregressive transformers with linear
attention. International Conference on Ma-
chine Learning, 5156–5165. 237
Kawaguchi, K., Huang, J., & Kaelbling, L. P.
(2019). Eect of depth and width on local
minima in deep learning. Neural Computation,
31(7), 1462–1498. 405
Ke, G., He, D., & Liu, T.-Y. (2021). Rethinking
positional encoding in language pre-training.
International Conference on Learning Repre-
sentations. 236
Kearnes, S., McCloskey, K., Berndl, M., Pande, V.,
& Riley, P. (2016). Molecular graph convolu-
tions: Moving beyond ngerprints. Journal of
computer-aided molecular design, 30(8), 595–
608. 264
Kendall, A., & Gal, Y. (2017). What uncertainties
do we need in Bayesian deep learning for com-
puter vision? Neural Information Processing
Systems, 30, 5574–5584. 158
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyan-
skiy, M., & Tang, P. T. P. (2017). On large-
batch training for deep learning: Generaliza-
tion gap and sharp minima. International Con-
ference on Learning Representations. 158, 403,
410, 411
Keskar, N. S., & Socher, R. (2017). Improving
generalization performance by switching from
Adam to SGD. arXiv:1712.07628. 94, 410
Keynes, J. M. (2010). Economic possibilities for
our grandchildren. Essays in Persuasion, 321–
332. Palgrave Macmillan. 430
Khan, S., Naseer, M., Hayat, M., Zamir, S. W.,
Khan, F. S., & Shah, M. (2022). Transformers
in vision: A survey. ACM Computing Surveys,
54(10), 200:1–200:41. 238
Killoran, N., Lee, L. J., Delong, A., Duvenaud, D.,
& Frey, B. J. (2017). Generating and designing
DNA with deep generative models. NIPS 2017
Workshop on Computational Biology. 299
Kim, H., & Mnih, A. (2018). Disentangling by fac-
torising. International Conference on Machine
Learning, 2649–2658. 345, 346
Kim, I., Han, S., Baek, J.-w., Park, S.-J.,
Han, J.-J., & Shin, J. (2021). Quality-
agnostic image recognition via invertible de-
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 483
coder. IEEE/CVF Computer Vision & Pat-
tern Recognition, 12257–12266. 322
Kim, S., Lee, S.-g., Song, J., Kim, J., & Yoon, S.
(2018). FloWaveNet: A generative ow for raw
audio. International Conference on Machine
Learning, 3370–3378. 322, 323
Kingma, D., Salimans, T., Poole, B., & Ho, J.
(2021). Variational diusion models. Neural
Information Processing Systems, 34, 21696–
21707. 369
Kingma, D. P., & Ba, J. (2015). Adam: A method
for stochastic optimization. International Con-
ference on Learning Representations. 93, 237
Kingma, D. P., & Dhariwal, P. (2018). Glow: Gen-
erative ow with invertible 1x1 convolutions.
Neural Information Processing Systems, 31,
10236–10245. 319, 322, 323
Kingma, D. P., Salimans, T., Jozefowicz, R., Chen,
X., Sutskever, I., & Welling, M. (2016). Im-
proved variational inference with inverse au-
toregressive ow. Neural Information Process-
ing Systems, 29, 4736–4744. 323, 344
Kingma, D. P., Salimans, T., & Welling, M. (2015).
Variational dropout and the local reparameter-
ization trick. Advances in neural information
processing systems, 28, 2575–2583. 346
Kingma, D. P., & Welling, M. (2014). Auto-
encoding variational Bayes. International
Conference on Learning Representations. 273,
343
Kingma, D. P., Welling, M., et al. (2019). An intro-
duction to variational autoencoders. Founda-
tions and Trends in Machine Learning, 12(4),
307–392. 343
Kipf, T. N., & Welling, M. (2016). Variational
graph auto-encoders. NIPS Bayesian Deep
Learning Workshop. 159, 344
Kipf, T. N., & Welling, M. (2017). Semi-supervised
classication with graph convolutional net-
works. International Conference on Learning
Representations. 262, 263, 264, 265
Kiranyaz, S., Avci, O., Abdeljaber, O., Ince, T.,
Gabbouj, M., & Inman, D. J. (2021). 1D con-
volutional neural networks and applications: A
survey. Mechanical Systems and Signal Pro-
cessing, 151, 107398. 182
Kiranyaz, S., Ince, T., Hamila, R., & Gabbouj,
M. (2015). Convolutional neural networks for
patient-specic ECG classication. Interna-
tional Conference of the IEEE Engineering in
Medicine and Biology Society, vol. 37, 2608–
2611. 182
Kitaev, N., Kaiser, Ł., & Levskaya, A. (2020).
Reformer: The ecient transformer. Inter-
national Conference on Learning Representa-
tions. 237
Kitcher, P. (2011a). The Ethical Project. Harvard
University Press. 432
Kitcher, P. (2011b). Science in a Democratic So-
ciety. Prometheus Books. 432
Klambauer, G., Unterthiner, T., Mayr, A., &
Hochreiter, S. (2017). Self-normalizing neural
networks. Neural Information Processing Sys-
tems, vol. 30, 972–981. 38, 113
Kleinberg, J., Mullainathan, S., & Raghavan, M.
(2017). Inherent trade-os in the fair deter-
mination of risk scores. Innovations in Theo-
retical Computer Science Conference, vol. 67,
1–23. 422
Kleinberg, R., Li, Y., & Yuan, Y. (2018). An al-
ternative view: When does SGD escape local
minima? International Conference on Ma-
chine Learning, 2703–2712. 411
Knight, W. (2018). One of the fathers of AI
is worried about its future. MIT Technol-
ogy Review, Nov 20, 2018. https : //www .
technologyreview.com / 2018 / 11/ 17 / 66372 /
one - of - the - fathers - of - ai - is- worried -
about-its-future/. 430
Kobyzev, I., Prince, S. J., & Brubaker, M. A.
(2020). Normalizing ows: An introduction
and review of current methods. IEEE Trans-
actions on Pattern Analysis & Machine Intel-
ligence, 43(11), 3964–3979. xii, 321, 324
Koenker, R., & Hallock, K. F. (2001). Quantile
regression. Journal of Economic Perspectives,
15(4), 143–156. 73
Köhler, J., Klein, L., & Noé, F. (2020). Equivariant
ows: Exact likelihood generative learning for
symmetric densities. International Conference
on Machine Learning, 5361–5370. 322, 324
Koller, D., & Friedman, N. (2009). Probabilistic
graphical models: Principles and techniques.
MIT Press. 15
Kolomiyets, O., Bethard, S., & Moens, M.-F.
(2011). Model-portability experiments for tex-
tual temporal analysis. Meeting of the Associ-
ation for Computational Linguistics, 271–276.
160
Konda, V., & Tsitsiklis, J. (1999). Actor-critic al-
gorithms. Neural Information Processing Sys-
tems, 12, 1008–1014. 397
Kong, Z., Ping, W., Huang, J., Zhao, K., & Catan-
zaro, B. (2021). DiWave: A versatile diusion
Draft: please send errata to udlbookmail@gmail.com.
484 Bibliography
model for audio synthesis. International Con-
ference on Learning Representations. 369
Kool, W., van Hoof, H., & Welling, M. (2019). At-
tention, learn to solve routing problems! In-
ternational Conference on Learning Represen-
tations. 396
Kosinski, M., Stillwell, D., & Graepel, T. (2013).
Private traits and attributes are predictable
from digital records of human behavior. Pro-
ceedings of the National Academy of Sciences
of the United States of America, 110(15),
5802–5805. 427
Kratsios, M. (2019). The national articial in-
telligence research and development strategic
plan: 2019 update. Tech. rep., Networking
and Information Technology Research and De-
velopment. https://www.nitrd.gov/pubs/
National-AI-RD-Strategy-2019.pdf. 430
Krizhevsky, A., & Hinton, G. (2009). Learning
multiple layers of features from tiny images.
Technical Report, University of Toronto. 188
Krizhevsky, A., Sutskever, I., & Hinton, G. E.
(2012). ImageNet classication with deep con-
volutional neural networks. Neural Informa-
tion Processing Systems, 25, 1097–1105. 52,
113, 159, 176, 181
Kruse, J., Detommaso, G., Köthe, U., & Scheichl,
R. (2021). HINT: Hierarchical invertible neural
transport for density estimation and Bayesian
inference. AAAI Conference on Articial In-
telligence, 8191–8199. 323
Kudo, T. (2018). Subword regularization: Improv-
ing neural network translation models with
multiple subword candidates. Meeting of the
Association for Computational Linguistics,
66–75. 234
Kudo, T., & Richardson, J. (2018). SentencePiece:
A simple and language independent subword
tokenizer and detokenizer for neural text pro-
cessing. Empirical Methods in Natural Lan-
guage Processing, 66–71. 234
Kukačka, J., Golkov, V., & Cremers, D. (2017).
Regularization for deep learning: A taxonomy.
arXiv:1710.10686. 155
Kulikov, I., Miller, A. H., Cho, K., & Weston, J.
(2018). Importance of search and evaluation
strategies in neural dialogue modeling. ACL
International Conference on Natural Language
Generation, 76–87. 235
Kumar, A., Fu, J., Soh, M., Tucker, G., & Levine,
S. (2019a). Stabilizing o-policy Q-learning via
bootstrapping error reduction. Neural Infor-
mation Processing Systems, 32, 11761–11771.
398
Kumar, A., Sattigeri, P., & Balakrishnan, A.
(2018). Variational inference of disentangled
latent concepts from unlabeled observations.
International Conference on Learning Repre-
sentations. 345
Kumar, A., Singh, S. S., Singh, K., & Biswas, B.
(2020a). Link prediction techniques, applica-
tions, and performance: A survey. Physica
A: Statistical Mechanics and its Applications,
553, 124289. 262
Kumar, A., Zhou, A., Tucker, G., & Levine, S.
(2020b). Conservative Q-learning for oine re-
inforcement learning. Neural Information Pro-
cessing Systems, 33, 1179–1191. 398
Kumar, M., Babaeizadeh, M., Erhan, D., Finn, C.,
Levine, S., Dinh, L., & Kingma, D. (2019b).
VideoFlow: A ow-based generative model for
video. ICML Workshop on Invertible Neural
Networks and Normalizing Flows. 322
Kumar, M., Weissenborn, D., & Kalchbrenner,
N. (2021). Colorization transformer. Inter-
national Conference on Learning Representa-
tions. 238
Kurach, K., Lučić, M., Zhai, X., Michalski, M., &
Gelly, S. (2019). A large-scale study on reg-
ularization and normalization in GANs. In-
ternational Conference on Machine Learning,
3581–3590. 299
Kurenkov, A. (2020). A Brief History of
Neural Nets and Deep Learning. https :
//www.skynettoday.com/overviews/neural-
net-history. 37
Kynkäänniemi, T., Karras, T., Laine, S., Lehti-
nen, J., & Aila, T. (2019). Improved precision
and recall metric for assessing generative mod-
els. Neural Information Processing Systems,
32, 3929–3938. 274
LaCroix, T. (2022). The linguistic blind spot
of value-aligned agency, natural and articial.
arXiv:2207.00868. 421
LaCroix, T. (2023). Articial Intelligence and
the Value-Alignment Problem: A Philosoph-
ical Introduction. https://value-alignment.
github.io. 422, 435
LaCroix, T., Geil, A., & O’Connor, C. (2021). The
dynamics of retraction in epistemic networks.
Philosophy of Science, 88(3), 415–438. 432
LaCroix, T., & Mohseni, A. (2022). The tragedy
of the AI commons. Synthese, 200(289). 420
Laont, J.-J., & Martimort, D. (2002). The The-
ory of Incentives: The Principal-Agent Model.
Princeton University Press. 421
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 485
Lakshminarayanan, B., Pritzel, A., & Blundell, C.
(2017). Simple and scalable predictive uncer-
tainty estimation using deep ensembles. Neu-
ral Information Processing Systems, 30, 6402–
6413. 158
Lamb, A., Dumoulin, V., & Courville, A. (2016).
Discriminative regularization for generative
models. arXiv:1602.03220. 344
Lample, G., & Charton, F. (2020). Deep learning
for symbolic mathematics. International Con-
ference on Learning Representations. 234
Larsen, A. B. L., Sønderby, S. K., Larochelle, H.,
& Winther, O. (2016). Autoencoding beyond
pixels using a learned similarity metric. In-
ternational Conference on Machine Learning,
1558–1566. 344, 345
Lasseck, M. (2018). Acoustic bird detection with
deep convolutional neural networks. Detec-
tion and Classication of Acoustic Scenes and
Events, 143–147. 160
Lattimore, T., & Szepesvári, C. (2020). Bandit
algorithms. Cambridge University Press. 136
Lawrence, S., Giles, C. L., Tsoi, A. C., & Back,
A. D. (1997). Face recognition: A convolu-
tional neural-network approach. IEEE Trans-
actions on Neural Networks, 8(1), 98–113. 181
LeCun, Y. (1985). Une procedure d’apprentissage
pour reseau a seuil asymmetrique. Proceedings
of Cognitiva, 599–604. 113
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep
learning. Nature, 521(7553), 436–444. 52
LeCun, Y., Boser, B., Denker, J., Henderson,
D., Howard, R., Hubbard, W., & Jackel, L.
(1989a). Handwritten digit recognition with
a back-propagation network. Neural Informa-
tion Processing Systems, 2, 396–404. 180, 181
LeCun, Y., Boser, B., Denker, J. S., Henderson, D.,
Howard, R. E., Hubbard, W., & Jackel, L. D.
(1989b). Backpropagation applied to hand-
written zip code recognition. Neural Compu-
tation, 1(4), 541–551. 180
LeCun, Y., Bottou, L., Bengio, Y., & Haner,
P. (1998). Gradient-based learning applied
to document recognition. Proceedings of the
IEEE, 86(11), 2278–2324. 159, 181
LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M.,
& Huang, F. (2006). A tutorial on energy-
based learning. Predicting structured data,
1(0). 274
LeCun, Y., Denker, J. S., & Solla, S. A. (1990).
Optimal brain damage. Neural Information
Processing Systems, vol. 3, 598–605. 414
LeCun, Y. A., Bottou, L., Orr, G. B., & Müller,
K.-R. (2012). Ecient backprop. Neural Net-
works: Tricks of the trade, 9–48. Springer. 113,
410
Ledig, C., Theis, L., Huszár, F., Caballero,
J., Cunningham, A., Acosta, A., Aitken,
A., Tejani, A., Totz, J., Wang, Z., et al.
(2017). Photo-realistic single image super-
resolution using a generative adversarial net-
work. IEEE/CVF Computer Vision & Pattern
Recognition, 4681–4690. 294, 301
Lee, J., Lee, I., & Kang, J. (2019). Self-attention
graph pooling. International Conference on
Machine Learning, 3734–3743. 265
Lee, J. B., Rossi, R. A., Kong, X., Kim, S., Koh,
E., & Rao, A. (2018). Higher-order graph con-
volutional networks. arXiv:1809.07697 . 263
Lehman, J., & Stanley, K. O. (2008). Exploiting
open-endedness to solve problems through the
search for novelty. International Conference
on Articial Life, 329–336. 421
Leuner, J. (2019). A replication study: Ma-
chine learning models are capable of pre-
dicting sexual orientation from facial images.
arXiv:1902.10739. 427
Li, C., Chen, C., Carlson, D., & Carin, L. (2016a).
Preconditioned stochastic gradient Langevin
dynamics for deep neural networks. AAAI
Conference on Articial Intelligence, 1788–
1794. 159
Li, C., Farkhoor, H., Liu, R., & Yosinski, J.
(2018a). Measuring the intrinsic dimension
of objective landscapes. International Confer-
ence on Learning Representations. 407, 408
Li, F.-F. (2018). How to make A.I. that’s good
for people. The New York Times, March
7, 2018. https : // www . nytimes . com / 2018/
03 / 07 / opinion / artificial - intelligence-
human.html. 430
Li, G., Müller, M., Ghanem, B., & Koltun, V.
(2021a). Training graph neural networks with
1000 layers. International Conference on Ma-
chine Learning, 6437–6449. 266, 322
Li, G., Müller, M., Qian, G., Perez, I. C. D.,
Abualshour, A., Thabet, A. K., & Ghanem,
B. (2021b). DeepGCNs: Making GCNs go as
deep as CNNs. IEEE Transactions on Pattern
Analysis and Machine Intelligence. 266
Li, G., Xiong, C., Thabet, A., & Ghanem, B.
(2020a). DeeperGCN: All you need to train
deeper GCNs. arXiv:2006.07739. 266
Draft: please send errata to udlbookmail@gmail.com.
486 Bibliography
Li, H., Kadav, A., Durdanovic, I., Samet, H., &
Graf, H. P. (2017a). Pruning lters for e-
cient ConvNets. International Conference on
Learning Representations. 414
Li, H., Xu, Z., Taylor, G., Studer, C., & Goldstein,
T. (2018b). Visualizing the loss landscape of
neural nets. Neural Information Processing
Systems, 31, 6391–6401. 201, 202, 407
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh,
A., & Talwalkar, A. (2017b). Hyperband: A
novel bandit-based approach to hyperparame-
ter optimization. Journal of Machine Learning
Research, 18(1), 6765–6816. 136
Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J., &
Chang, K.-W. (2019). VisualBERT: A sim-
ple and performant baseline for vision and lan-
guage. arXiv:1908.03557. 238
Li, Q., Han, Z., & Wu, X.-M. (2018c). Deeper
insights into graph convolutional networks for
semi-supervised learning. AAAI Conference
on Articial Intelligence, 3438–3545. 265
Li, S., Zhao, Y., Varma, R., Salpekar, O., No-
ordhuis, P., Li, T., Paszke, A., Smith, J.,
Vaughan, B., Damania, P., & Chintala, S.
(2020b). Pytorch distributed: Experiences on
accelerating data parallel training. Interna-
tional Conference on Very Large Databases.
114
Li, W., Lin, Z., Zhou, K., Qi, L., Wang, Y., & Jia,
J. (2022). MAT: Mask-aware transformer for
large hole image inpainting. IEEE/CVF Com-
puter Vision & Pattern Recognition, 10758–
10768. 238
Li, Y. (2017). Deep reinforcement learning: An
overview. arXiv:1701.07274. 396
Li, Y., Cohn, T., & Baldwin, T. (2017c). Robust
training under linguistic adversity. Meeting of
the Association for Computational Linguistics,
21–27. 160
Li, Y., & Liang, Y. (2018). Learning overparame-
terized neural networks via stochastic gradient
descent on structured data. Neural Informa-
tion Processing Systems, 31, 8168–8177. 407
Li, Y., Tarlow, D., Brockschmidt, M., & Zemel,
R. (2016b). Gated graph sequence neural net-
works. International Conference on Learning
Representations. 262
Li, Y., & Turner, R. E. (2016). Rényi divergence
variational inference. Neural Information Pro-
cessing Systems, 29, 1073–1081. 346
Li, Z., & Arora, S. (2019). An exponential learning
rate schedule for deep learning. International
Conference on Learning Representations. 204
Liang, D., Krishnan, R. G., Homan, M. D., &
Jebara, T. (2018). Variational autoencoders
for collaborative ltering. World Wide Web
Conference, 689–698. 344
Liang, J., Zhang, K., Gu, S., Van Gool, L.,
& Timofte, R. (2021). Flow-based ker-
nel prior with application to blind super-
resolution. IEEE/CVF Computer Vision &
Pattern Recognition, 10601–10610. 322
Liang, S., & Srikant, R. (2016). Why deep neural
networks for function approximation? Inter-
national Conference on Learning Representa-
tions. 53, 417
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N.,
Erez, T., Tassa, Y., Silver, D., & Wierstra,
D. (2016). Continuous control with deep rein-
forcement learning. International Conference
on Learning Representations. 397
Lin, K., Li, D., He, X., Zhang, Z., & Sun, M.-T.
(2017a). Adversarial ranking for language gen-
eration. Neural Information Processing Sys-
tems, 30, 3155–3165. 299
Lin, L.-J. (1992). Self-improving reactive agents
based on reinforcement learning, planning and
teaching. Machine learning, 8, 293–321. 396
Lin, M., Chen, Q., & Yan, S. (2014). Network in
network. International Conference on Learn-
ing Representations. 181
Lin, T., Wang, Y., Liu, X., & Qiu, X. (2022). A
survey of transformers. AI Open, 3, 111–132.
233
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariha-
ran, B., & Belongie, S. (2017b). Feature pyra-
mid networks for object detection. IEEE Com-
puter Vision & Pattern Recognition, 2117–
2125. 184
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dol-
lár, P. (2017c). Focal loss for dense object de-
tection. IEEE/CVF International Conference
on Computer Vision, 2980–2988. 73
Lin, Z., Khetan, A., Fanti, G., & Oh, S. (2018).
PacGAN: The power of two samples in genera-
tive adversarial networks. Neural Information
Processing Systems, 31, 1505–1514. 300
Ling, H., Kreis, K., Li, D., Kim, S. W., Tor-
ralba, A., & Fidler, S. (2021). EditGAN: High-
precision semantic image editing. Neural Infor-
mation Processing Systems, 34, 16331–16345.
302
Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel,
M., & Le, M. (2022). Flow matching for gen-
erative modeling. arXiv:2210.02747. 369
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 487
Lipton, Z. C., & Tripathi, S. (2017). Precise re-
covery of latent vectors from generative adver-
sarial networks. International Conference on
Learning Representations. 301
Liu, G., Reda, F. A., Shih, K. J., Wang, T.-C.,
Tao, A., & Catanzaro, B. (2018a). Image in-
painting for irregular holes using partial con-
volutions. European Conference on Computer
Vision, 85–100. 181
Liu, H., Simonyan, K., & Yang, Y. (2019a).
DARTS: Dierentiable architecture search. In-
ternational Conference on Learning Represen-
tations. 414
Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao,
J., & Han, J. (2021a). On the variance of
the adaptive learning rate and beyond. Inter-
national Conference on Learning Representa-
tions. 93
Liu, L., Liu, X., Gao, J., Chen, W., & Han, J.
(2020). Understanding the diculty of training
transformers. Empirical Methods in Natural
Language Processing, 5747–5763. 237, 238
Liu, L., Luo, Y., Shen, X., Sun, M., & Li, B.
(2019b). Beta-dropout: A unied dropout.
IEEE Access, 7, 36140–36153. 158
Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sep-
assi, R., Kaiser, L., & Shazeer, N. (2018b).
Generating Wikipedia by summarizing long se-
quences. International Conference on Learn-
ing Representations. 237
Liu, X., Zhang, F., Hou, Z., Mian, L., Wang, Z.,
Zhang, J., & Tang, J. (2023a). Self-supervised
learning: Generative or contrastive. IEEE
Transactions on Knowledge and Data Engi-
neering, 35(1), 857–876. 159
Liu, Y., Qin, Z., Anwar, S., Ji, P., Kim, D., Cald-
well, S., & Gedeon, T. (2021b). Invertible de-
noising network: A light solution for real noise
removal. IEEE/CVF Computer Vision & Pat-
tern Recognition, 13365–13374. 322
Liu, Y., Zhang, Y., Wang, Y., Hou, F., Yuan, J.,
Tian, J., Zhang, Y., Shi, Z., Fan, J., & He,
Z. (2023b). A survey of visual transformers.
IEEE Transactions on Neural Networks and
Learning Systems. 238
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei,
Y., Ning, J., Cao, Y., Zhang, Z., Dong, L.,
Wei, F., & Guo, B. (2022). Swin trans-
former V2: Scaling up capacity and resolu-
tion. IEEE/CVF Computer Vision & Pattern
Recognition, 12009–12019. 238
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y.,
Zhang, Z., Lin, S., & Guo, B. (2021c). Swin
transformer: Hierarchical vision transformer
using shifted windows. IEEE/CVF Inter-
national Conference on Computer Vision,
10012–10022. 231, 238
Liu, Z., Luo, P., Wang, X., & Tang, X. (2015).
Deep learning face attributes in the wild. IEEE
International Conference on Computer Vision,
3730–3738. 345
Liu, Z., Michaud, E. J., & Tegmark, M. (2023c).
Omnigrok: Grokking beyond algorithmic data.
International Conference on Learning Repre-
sentations. 405, 406, 412, 413
Liu, Z., Sun, M., Zhou, T., Huang, G., & Darrell,
T. (2019c). Rethinking the value of network
pruning. International Conference on Learn-
ing Representations. 235
Livni, R., Shalev-Shwartz, S., & Shamir, O. (2014).
On the computational eciency of training
neural networks. Neural Information Process-
ing Systems, 27, 855–863. 405
Locatello, F., Weissenborn, D., Unterthiner, T.,
Mahendran, A., Heigold, G., Uszkoreit, J.,
Dosovitskiy, A., & Kipf, T. (2020). Object-
centric learning with slot attention. Neural
Information Processing Systems, 33, 11525–
11538. 238
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully
convolutional networks for semantic segmenta-
tion. IEEE/CVF Computer Vision & Pattern
Recognition, 3431–3440. 181
Longino, H. E. (1990). Science as Social Knowl-
edge: Values and Objectivity in Scientic In-
quiry. Princeton University Press. 432
Longino, H. E. (1996). Cognitive and non-cognitive
values in science: Rethinking the dichotomy.
Feminism, Science, and the Philosophy of Sci-
ence, 39–58. 432
Loshchilov, I., & Hutter, F. (2019). Decou-
pled weight decay regularization. International
Conference on Learning Representations. 94,
156
Louizos, C., Welling, M., & Kingma, D. P. (2018).
Learning sparse neural networks through l
0
regularization. International Conference on
Learning Representations. 156
Loukas, A. (2020). What graph neural networks
cannot learn: Depth vs width. International
Conference on Learning Representations. 262
Lu, J., Batra, D., Parikh, D., & Lee, S. (2019).
VilBERT: Pretraining task-agnostic visiolin-
guistic representations for vision-and-language
tasks. Neural Information Processing Systems,
32, 13–23. 238
Draft: please send errata to udlbookmail@gmail.com.
488 Bibliography
Lu, S.-P., Wang, R., Zhong, T., & Rosin,
P. L. (2021). Large-capacity image steganog-
raphy based on invertible neural networks.
IEEE/CVF Computer Vision & Pattern
Recognition, 10816–10825. 322
Lu, Z., Pu, H., Wang, F., Hu, Z., & Wang, L.
(2017). The expressive power of neural net-
works: A view from the width. Neural Infor-
mation Processing Systems, 30, 6231–6239. 53
Lubana, E. S., Dick, R., & Tanaka, H. (2021).
Beyond BatchNorm: Towards a unied un-
derstanding of normalization in deep learning.
Neural Information Processing Systems, 34,
4778–4791. 204
Lucas, J., Tucker, G., Grosse, R., & Norouzi, M.
(2019a). Understanding posterior collapse in
generative latent variable models. ICLR Work-
shop on Deep Generative Models for Highly
Structured Data. 345
Lucas, J., Tucker, G., Grosse, R. B., & Norouzi, M.
(2019b). Don’t blame the ELBO! A linear VAE
perspective on posterior collapse. Neural In-
formation Processing Systems, 32, 9403–9413.
345
Luccioni, A. S. (2023). The mounting human and
environmental costs of generative AI. ars Tech-
nica, April 12, 2023.https://arstechnica.
com / gadgets / 2023 / 04 / generative - ai - is -
cool-but-lets-not-forget-its-human-and-
environmental-costs. 429
Luccioni, A. S., Viguier, S., & Ligozat, A.-L.
(2022). Estimating the carbon footprint of
bloom, a 176b parameter language model.
arXiv:2211.02001. 429
Lucic, M., Kurach, K., Michalski, M., Gelly, S., &
Bousquet, O. (2018). Are GANs created equal?
A large-scale study. Neural Information Pro-
cessing Systems, 31, 698–707. 299
Lücke, J., Forster, D., & Dai, Z. (2020). The
evidence lower bound of variational autoen-
coders converges to a sum of three entropies.
arXiv:2010.14860. 346
Luo, C. (2022). Understanding diusion models:
A unied perspective. arXiv:2208.11970. 369
Luo, G., Heide, M., & Uecker, M. (2022).
MRI reconstruction via data driven Markov
chain with joint uncertainty estimation.
arXiv:2202.01479. 369
Luo, J., Xu, Y., Tang, C., & Lv, J. (2017a). Learn-
ing inverse mapping by autoencoder based gen-
erative adversarial nets. Neural Information
Processing Systems, vol. 30, 207–216. 301
Luo, J.-H., Wu, J., & Lin, W. (2017b). ThiNet: A
lter level pruning method for deep neural net-
work compression. IEEE/CVF International
Conference on Computer Vision, 5058–5066.
414
Luo, P., Wang, X., Shao, W., & Peng, Z. (2018).
Towards understanding regularization in batch
normalization. International Conference on
Learning Representations. 205
Luo, S., & Hu, W. (2021). Diusion proba-
bilistic models for 3D point cloud genera-
tion. IEEE/CVF Computer Vision & Pattern
Recognition, 2837–2845. 369
Luong, M.-T., Pham, H., & Manning, C. D. (2015).
Eective approaches to attention-based neu-
ral machine translation. Empirical Methods in
Natural Language Processing, 1412–1421. 235
Luther, K. (2020). Why BatchNorm causes explod-
ing gradients. https://kyleluther.github.
io / 2020 / 02 / 18 / BatchNorm - exploding -
gradients.html. 203
Ma, Y., & Tang, J. (2021). Deep learning on
graphs. Cambridge University Press. 261
Ma, Y.-A., Chen, T., & Fox, E. (2015). A complete
recipe for stochastic gradient MCMC. Neu-
ral Information Processing Systems, 28, 2917–
2925. 159
Maaløe, L., Sønderby, C. K., Sønderby, S. K., &
Winther, O. (2016). Auxiliary deep generative
models. International Conference on Machine
Learning, 1445–1453. 344, 345
Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013).
Rectier nonlinearities improve neural network
acoustic models. ICML Workshop on Deep
Learning for Audio, Speech, and Language
Processing. 38
MacKay, D. J. (1995). Ensemble learning and evi-
dence maximization. Neural Information Pro-
cessing Systems, vol. 8, 4083–4090. 159
MacKay, M., Vicol, P., Ba, J., & Grosse, R. B.
(2018). Reversible recurrent neural networks.
Neural Information Processing Systems, 31,
9043–9054. 322
Mackowiak, R., Ardizzone, L., Kothe, U., &
Rother, C. (2021). Generative classiers
as a basis for trustworthy image classica-
tion. IEEE/CVF Computer Vision & Pattern
Recognition, 2971–2981. 322
Madhawa, K., Ishiguro, K., Nakago, K., &
Abe, M. (2019). GraphNVP: An invertible
ow model for generating molecular graphs.
arXiv:1905.11600. 322
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 489
Mahendran, A., & Vedaldi, A. (2015). Understand-
ing deep image representations by inverting
them. IEEE/CVF Computer Vision & Pat-
tern Recognition, 5188–5196. 184
Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I.,
& Frey, B. (2015). Adversarial autoencoders.
arXiv:1511.05644. 345
Mangalam, K., Fan, H., Li, Y., Wu, C.-Y., Xiong,
B., Feichtenhofer, C., & Malik, J. (2022). Re-
versible vision transformers. IEEE/CVF Com-
puter Vision & Pattern Recognition, 10830–
10840. 322
Manning, C., & Schutze, H. (1999). Foundations
of statistical natural language processing. MIT
Press. 233
Manyika, J., Lund, S., Chui, M., Bughin, J., Woet-
zel, J., Batra, P., Ko, R., & Sanghvi, S. (2017).
Jobs Lost, Jobs Gained: Workforce Transi-
tions in a Time of Automation. McKinsey
Global Institute. 429
Manyika, J., & Sneader, K. (2018). AI, automa-
tion, and the future of work: Ten things to
solve for. McKinsey Global Institute. 429
Mao, Q., Lee, H.-Y., Tseng, H.-Y., Ma, S., &
Yang, M.-H. (2019). Mode seeking generative
adversarial networks for diverse image synthe-
sis. IEEE/CVF Computer Vision & Pattern
Recognition, 1429–1437. 300
Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z.,
& Paul Smolley, S. (2017). Least squares gen-
erative adversarial networks. IEEE/CVF In-
ternational Conference on Computer Vision,
2794–2802. 299
Marchesi, M. (2017). Megapixel size image cre-
ation using generative adversarial networks.
arXiv:1706.00082. 299
Martin, G. L. (1993). Centered-object integrated
segmentation and recognition of overlapping
handprinted characters. Neural Computation,
5(3), 419–429. 181
Masci, J., Boscaini, D., Bronstein, M., & Van-
dergheynst, P. (2015). Geodesic convolutional
neural networks on Riemannian manifolds.
IEEE International Conference on Computer
Vision Workshop, 832–840. 265
Masrani, V., Le, T. A., & Wood, F. (2019). The
thermodynamic variational objective. Neural
Information Processing Systems, 32, 11521–
11530. 346
Mathieu, E., Rainforth, T., Siddharth, N., &
Teh, Y. W. (2019). Disentangling disentan-
glement in variational autoencoders. Interna-
tional Conference on Machine Learning, 4402–
4412. 346
Matsakis, L. (2017). A frightening AI can deter-
mine whether a person is gay with 91 per-
cent accuracy. Vice, Sept 8, 2017. https:
/ / www . vice . com / en / article / a33xb4 / a -
frightening-ai-can-determine-a-persons-
sexuality-with-91-accuracy. 430
Maturana, D., & Scherer, S. (2015). VoxNet: A
3D convolutional neural network for real-time
object recognition. IEEE/RSJ International
Conference on Intelligent Robots and Systems,
922–928. 182
Mayson, S. G. (2018). Bias in bias out. Yale Law
Journal, 128, 2122–2473. 422
Mazoure, B., Doan, T., Durand, A., Pineau, J., &
Hjelm, R. D. (2020). Leveraging exploration
in o-policy algorithms via normalizing ows.
Conference on Robot Learning, 430–444. 322
Mazyavkina, N., Sviridov, S., Ivanov, S., & Bur-
naev, E. (2021). Reinforcement learning for
combinatorial optimization: A survey. Com-
puters & Operations Research, 134, 105400.
396
McCoy, R. T., Pavlick, E., & Linzen, T. (2019).
Right for the wrong reasons: Diagnosing syn-
tactic heuristics in natural language inference.
Meeting of the Association for Computational
Linguistics, 2428–3448. 234
McCulloch, W. S., & Pitts, W. (1943). A logi-
cal calculus of the ideas immanent in nervous
activity. The Bulletin of Mathematical Bio-
physics, 5(4), 115–133. 37
McNamara, A., Smith, J., & Murphy-Hill, E.
(2018). Does ACM’s code of ethics change
ethical decision making in software develop-
ment? ACM Joint Meeting on European Soft-
ware Engineering Conference and Symposium
on the Foundations of Software Engineering,
729–733. 420
Mehrabi, N., Morstatter, F., Saxena, N., Lerman,
K., & Galstyan, A. (2022). A survey on bias
and fairness in machine learning. ACM Com-
puting Surveys, 54(6), 1–35. 423
Mei, J., Chung, W., Thomas, V., Dai, B.,
Szepesvári, C., & Schuurmans, D. (2022). The
role of baselines in policy gradient optimiza-
tion. Neural Information Processing Systems,
vol. 35, 17818–17830. 397
Meng, C., Song, Y., Song, J., Wu, J., Zhu, J.-Y.,
& Ermon, S. (2021). SDEdit: Image synthesis
and editing with stochastic dierential equa-
tions. International Conference on Learning
Representations. 369
Draft: please send errata to udlbookmail@gmail.com.
490 Bibliography
Menon, S., Damian, A., Hu, S., Ravi, N., & Rudin,
C. (2020). PULSE: self-supervised photo up-
sampling via latent space exploration of gener-
ative models. IEEE/CVF Computer Vision &
Pattern Recognition, 2434–2442. 422
Metcalf, J., Keller, E. F., & Boyd, D. (2016).
Perspectives on big data, ethics, and society.
Council for Big Data, Ethics, and Society.
https : / / bdes . datasociety . net / council -
output/perspectives-on-big- data-ethics-
and-society/. 430
Metz, L., Poole, B., Pfau, D., & Sohl-Dickstein,
J. (2017). Unrolled generative adversarial net-
works. International Conference on Learning
Representations. 299
Mézard, M., & Mora, T. (2009). Constraint sat-
isfaction problems and neural networks: A
statistical physics perspective. Journal of
Physiology-Paris, 103(1-2), 107–113. 94
Micelli, M., Posada, J., & Yang, T. (2022). Study-
ing up machine learning data: Why talk about
bias when we mean power? Proceedngs of
ACM on Human-Computer Interaction, 6. 423
Milletari, F., Navab, N., & Ahmadi, S.-A. (2016).
V-Net: Fully convolutional neural networks for
volumetric medical image segmentation. Inter-
national Conference on 3D Vision, 565–571.
205
Min, J., McCoy, R. T., Das, D., Pitler, E., &
Linzen, T. (2020). Syntactic data augmenta-
tion increases robustness to inference heuris-
tics. Meeting of the Association for Computa-
tional Linguistics, 2339–2352. 160
Minaee, S., Boykov, Y. Y., Porikli, F., Plaza, A. J.,
Kehtarnavaz, N., & Terzopoulos, D. (2021).
Image segmentation using deep learning: A
survey. IEEE Transactions on Pattern Analy-
sis & Machine Intelligence, 44(7), 3523–3542.
184
Minsky, M., & Papert, S. A. (1969). Perceptrons:
An introduction to computational geometry.
MIT Press. 37, 233
Mireshghallah, F., Taram, M., Vepakomma, P.,
Singh, A., Raskar, R., & Esmaeilzadeh, H.
(2020). Privacy in deep learning: A survey.
arXiv:2004.12254. 428
Mirza, M., & Osindero, S. (2014). Conditional gen-
erative adversarial nets. arXiv:1411.1784. 301
Mishkin, D., & Matas, J. (2016). All you need
is a good init. International Conference on
Learning Representations. 113
Mitchell, M., Forrest, S., & Holland, J. H. (1992).
The royal road for genetic algorithms: Fitness
landscapes and GA performance. European
Conference on Articial Life. 421
Mitchell, S., Potash, E., Barocas, S., D’Amour,
A., & Lum, K. (2021). Algorithmic fairness:
Choices, assumptions, and denitions. Annual
Review of Statistics and Its Application, 8,
141–163. 422
Miyato, T., Kataoka, T., Koyama, M., & Yoshida,
Y. (2018). Spectral normalization for genera-
tive adversarial networks. International Con-
ference on Learning Representations. 299
Miyato, T., & Koyama, M. (2018). cGANs with
projection discriminator. International Con-
ference on Learning Representations. 301
Mnih, V., Badia, A. P., Mirza, M., Graves,
A., Lillicrap, T., Harley, T., Silver, D., &
Kavukcuoglu, K. (2016). Asynchronous meth-
ods for deep reinforcement learning. Interna-
tional Conference on Machine Learning, 1928–
1937. 398
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu,
A. A., Veness, J., Bellemare, M. G., Graves,
A., Riedmiller, M., Fidjeland, A. K., Ostro-
vski, G., et al. (2015). Human-level control
through deep reinforcement learning. Nature,
518(7540), 529–533. 396
Moerland, T. M., Broekens, J., Plaat, A., Jonker,
C. M., et al. (2023). Model-based reinforce-
ment learning: A survey. Foundations and
Trends in Machine Learning, 16(1), 1–118. 398
Mogren, O. (2016). C-RNN-GAN: Continuous re-
current neural networks with adversarial train-
ing. NIPS 2016 Constructive Machine Learn-
ing Workshop. 299
Molnar, C. (2022). Interpretable Machine Learn-
ing: A Guide for Making Black Box Models
Explainable. https://christophm.github.io/
interpretable-ml-book. 425
Monti, F., Boscaini, D., Masci, J., Rodola, E., Svo-
boda, J., & Bronstein, M. M. (2017). Geomet-
ric deep learning on graphs and manifolds us-
ing mixture model CNNs. IEEE/CVF Com-
puter Vision & Pattern Recognition, 5115–
5124. 263, 265
Monti, F., Shchur, O., Bojchevski, A., Litany,
O., Günnemann, S., & Bronstein, M. M.
(2018). Dual-primal graph convolutional net-
works. arXiv:1806.00770. 264
Montúfar, G. (2017). Notes on the number of linear
regions of deep neural networks. 52, 53
Montúfar, G. F., Pascanu, R., Cho, K., & Bengio,
Y. (2014). On the number of linear regions
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 491
of deep neural networks. Neural Information
Processing Systems, 27, 2924–2932. 52, 53
Moor, J. (2006). The nature, importance, and dif-
culty of machine ethics. Intelligence Systems,
21(4), 18–21. 424
Moore, A., & Himma, K. (2022). Intellectual Prop-
erty. The Stanford Encyclopedia of Philoso-
phy. 428
Moreno-Torres, J. G., Raeder, T., Alaiz-Rodríguez,
R., Chawla, N. V., & Herrera, F. (2012). A
unifying view on dataset shift in classication.
Pattern Recognition, 45(1), 521–530. 135
Morimura, T., Sugiyama, M., Kashima, H.,
Hachiya, H., & Tanaka, T. (2010). Nonpara-
metric return distribution approximation for
reinforcement learning. International Confer-
ence on Machine Learning, 799–806. 397
Müller, R., Kornblith, S., & Hinton, G. E. (2019a).
When does label smoothing help? Neural In-
formation Processing Systems, 32, 4696–4705.
158
Müller, T., McWilliams, B., Rousselle, F., Gross,
M., & Novák, J. (2019b). Neural importance
sampling. ACM Transactions on Graphics
(TOG), 38(5), 1–19. 322, 323
Mun, S., Shon, S., Kim, W., Han, D. K., & Ko,
H. (2017). Deep neural network based learn-
ing and transferring mid-level audio features
for acoustic scene classication. IEEE Inter-
national Conference on Acoustics, Speech and
Signal Processing, 796–800. 160
Murphy, K. P. (2022). Probabilistic machine learn-
ing: An introduction. MIT Press. 15
Murphy, K. P. (2023). Probabilistic machine learn-
ing: Advanced topics. MIT Press. 15
Murphy, R. L., Srinivasan, B., Rao, V., & Ribeiro,
B. (2018). Janossy pooling: Learning deep
permutation-invariant functions for variable-
size inputs. International Conference on
Learning Representations. 263
Murty, K. G., & Kabadi, S. N. (1987). Some
NP-complete problems in quadratic and non-
linear programming. Mathematical Program-
ming, 39(2), 117–129. 401
Mutlu, E. C., Oghaz, T., Rajabi, A., & Garibay,
I. (2020). Review on learning and extract-
ing graph features for link prediction. Ma-
chine Learning and Knowledge Extraction,
2(4), 672–704. 262
Nair, V., & Hinton, G. E. (2010). Rectied linear
units improve restricted Boltzmann machines.
International Conference on Machine Learn-
ing, 807–814. 37
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T.,
Barak, B., & Sutskever, I. (2021). Deep double
descent: Where bigger models and more data
hurt. Journal of Statistical Mechanics: Theory
and Experiment, 2021(12), 124003. 130, 134
Narang, S., Chung, H. W., Tay, Y., Fedus, W.,
Fevry, T., Matena, M., Malkan, K., Fiedel, N.,
Shazeer, N., Lan, Z., et al. (2021). Do trans-
former modications transfer across implemen-
tations and applications? Empirical Methods
in Natural Language Processing, 5758–5773.
233
Narayanan, A., & Shmatikov, V. (2008). Ro-
bust de-anonymization of large sparse datasets.
IEEE Symposium on Security and Privacy,
111–125. 428
Narayanan, D., Phanishayee, A., Shi, K., Chen,
X., & Zaharia, M. (2021a). Memory-ecient
pipeline-parallel DNN training. International
Conference on Machine Learning, 7937–7947.
114
Narayanan, D., Shoeybi, M., Casper, J., LeGres-
ley, P., Patwary, M., Korthikanti, V., Vain-
brand, D., Kashinkunti, P., Bernauer, J.,
Catanzaro, B., et al. (2021b). Ecient large-
scale language model training on GPU clusters
using Megatron-LM. International Conference
for High Performance Computing, Network-
ing, Storage and Analysis, 1–15. 114
Nash, C., Menick, J., Dieleman, S., & Battaglia,
P. W. (2021). Generating images with sparse
representations. International Conference on
Machine Learning, 7958–7968. 238, 274
Neal, R. M. (1995). Bayesian learning for neural
networks. Springer. 159
Neimark, D., Bar, O., Zohar, M., & Assel-
mann, D. (2021). Video transformer net-
work. IEEE/CVF International Conference
on Computer Vision, 3163–3172. 238
Nesterov, Y. E. (1983). A method for solving
the convex programming problem with conver-
gence rate. Doklady Akademii Nauk SSSR, vol.
269, 543–547. 93
Newell, A., Yang, K., & Deng, J. (2016). Stacked
hourglass networks for human pose estima-
tion. European Conference on Computer Vi-
sion, 483–499. 200, 205
Neyshabur, B., Bhojanapalli, S., McAllester, D., &
Srebro, N. (2017). Exploring generalization in
deep learning. Neural Information Processing
Systems, 30, 5947–5956. 134, 412
Neyshabur, B., Bhojanapalli, S., & Srebro,
N. (2018). A PAC-Bayesian approach to
Draft: please send errata to udlbookmail@gmail.com.
492 Bibliography
spectrally-normalized margin bounds for neu-
ral networks. International Conference on
Learning Representations. 156
Ng, N. H., Gabriel, R. A., McAuley, J., Elkan, C.,
& Lipton, Z. C. (2017). Predicting surgery du-
ration with neural heteroscedastic regression.
PMLR Machine Learning for Healthcare Con-
ference, 100–111. 74
Nguyen, Q., & Hein, M. (2017). The loss surface of
deep and wide neural networks. International
Conference on Machine Learning, 2603–2612.
405
Nguyen, Q., & Hein, M. (2018). Optimization land-
scape and expressivity of deep CNNs. Interna-
tional Conference on Machine Learning, 3730–
3739. 405
Nichol, A. Q., & Dhariwal, P. (2021). Improved
denoising diusion probabilistic models. In-
ternational Conference on Machine Learning,
8162–8171. 369
Nichol, A. Q., Dhariwal, P., Ramesh, A., Shyam,
P., Mishkin, P., McGrew, B., Sutskever, I., &
Chen, M. (2022). GLIDE: towards photore-
alistic image generation and editing with text-
guided diusion models. International Confer-
ence on Machine Learning, 16784–16804. 369,
370
Nie, W., Guo, B., Huang, Y., Xiao, C., Vahdat, A.,
& Anandkumar, A. (2022). Diusion models
for adversarial purication. International Con-
ference on Machine Learning, 16805–16827.
369
Nix, D. A., & Weigend, A. S. (1994). Estimating
the mean and variance of the target probability
distribution. IEEE International Conference
on Neural Networks, 55–60. 73
Noble, S. (2018). Algorithms of Oppression. New
York: NYU Press. 433
Noci, L., Roth, K., Bachmann, G., Nowozin, S., &
Hofmann, T. (2021). Disentangling the roles of
curation, data-augmentation and the prior in
the cold posterior eect. Neural Information
Processing Systems, 34, 12738–12748. 159
Noé, F., Olsson, S., Köhler, J., & Wu, H. (2019).
Boltzmann generators: Sampling equilibrium
states of many-body systems with deep learn-
ing. Science, 365(6457). 322
Noh, H., Hong, S., & Han, B. (2015). Learning
deconvolution network for semantic segmenta-
tion. IEEE International Conference on Com-
puter Vision, 1520–1528. 6, 179, 180, 184
Noothigattu, R., Gaikwad, S. N., Awad, E.,
Dsouza, S., Rahwan, I., Ravikumar, P., &
Procaccia, A. D. (2018). A voting-based sys-
tem for ethical decision making. AAAI Por-
tuguese Conference on Articial Intelligence,
1587–1594. 424
Noroozi, M., & Favaro, P. (2016). Unsupervised
learning of visual representations by solving
jigsaw puzzles. European Conference on Com-
puter Vision, 69–84. 159
Nowozin, S., Cseke, B., & Tomioka, R. (2016). f-
GAN: Training generative neural samplers us-
ing variational divergence minimization. Neu-
ral Information Processing Systems, 29, 271–
279. 299
Nye, M., & Saxe, A. (2018). Are ecient deep rep-
resentations learnable? International Confer-
ence on Learning Representations (Workshop).
417
O’Connor, C., & Bruner, J. (2019). Dynamics and
diversity in epistemic communities. Erkennt-
nis, 84, 101–119. 433
Odena, A. (2019). Open questions about gen-
erative adversarial networks. Distill, https:
//distill.pub/2019/gan-open-problems. 299
Odena, A., Dumoulin, V., & Olah, C. (2016). De-
convolution and checkerboard artifacts. Dis-
till, https : / / distill . pub / 2016 / deconv -
checkerboard/. 181
Odena, A., Olah, C., & Shlens, J. (2017). Condi-
tional image synthesis with auxiliary classier
GANs. International Conference on Machine
Learning, 2642–2651. 290, 301
O’Neil, C. (2016). Weapons of Math Destruction.
Crown. 420, 422
Oono, K., & Suzuki, T. (2019). Graph neural net-
works exponentially lose expressive power for
node classication. International Conference
on Learning Representations. 265
Orhan, A. E., & Pitkow, X. (2017). Skip con-
nections eliminate singularities. International
Conference on Learning Representations. 202
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wain-
wright, C., Mishkin, P., Zhang, C., Agarwal,
S., Slama, K., Ray, A., et al. (2022). Training
language models to follow instructions with hu-
man feedback. Neural Information Processing
Systems, 35, 27730–27744. 398
Papamakarios, G., Nalisnick, E. T., Rezende,
D. J., Mohamed, S., & Lakshminarayanan,
B. (2021). Normalizing ows for probabilistic
modeling and inference. Journal of Machine
Learning Research, 22(57), 1–64. 321
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 493
Papamakarios, G., Pavlakou, T., & Murray, I.
(2017). Masked autoregressive ow for den-
sity estimation. Neural Information Process-
ing Systems, 30, 2338–2347. 323
Park, D., Hoshi, Y., & Kemp, C. C. (2018). A mul-
timodal anomaly detector for robot-assisted
feeding using an LSTM-based variational au-
toencoder. IEEE Robotics and Automation
Letters, 3(3), 1544–1551. 344
Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C.,
Zoph, B., Cubuk, E. D., & Le, Q. V. (2019).
SpecAugment: A simple data augmentation
method for automatic speech recognition. IN-
TERSPEECH. 160
Park, S., & Kwak, N. (2016). Analysis on the
dropout eect in convolutional neural net-
works. Asian Conference on Computer Vision,
189–204. 183
Park, S.-W., Ko, J.-S., Huh, J.-H., & Kim, J.-C.
(2021). Review on generative adversarial net-
works: Focusing on computer vision and its
applications. Electronics, 10(10), 1216. 299
Parker, D. B. (1985). Learning-logic: Casting the
cortex of the human brain in silicon. Alfred P.
Sloan School of Management, MIT. 113
Parmar, N., Ramachandran, P., Vaswani, A.,
Bello, I., Levskaya, A., & Shlens, J. (2019).
Stand-alone self-attention in vision models.
Neural Information Processing Systems, 32,
68–80. 238
Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L.,
Shazeer, N., Ku, A., & Tran, D. (2018). Im-
age transformer. International Conference on
Machine Learning, 4055–4064. 238
Pascanu, R., Dauphin, Y. N., Ganguli, S., & Ben-
gio, Y. (2014). On the saddle point problem
for non-convex optimization. arXiv:1405.4604.
405
Pascanu, R., Montúfar, G., & Bengio, Y. (2013).
On the number of response regions of deep feed
forward networks with piece-wise linear activa-
tions. arXiv:1312.6098. 53
Paschalidou, D., Katharopoulos, A., Geiger, A., &
Fidler, S. (2021). Neural parts: Learning ex-
pressive 3D shape abstractions with invertible
neural networks. IEEE/CVF Computer Vision
& Pattern Recognition, 3204–3215. 322
Patashnik, O., Wu, Z., Shechtman, E., Cohen-
Or, D., & Lischinski, D. (2021). StyleCLIP:
Text-driven manipulation of StyleGAN im-
agery. IEEE/CVF International Conference
on Computer Vision, 2085–2094. 300
Pateria, S., Subagdja, B., Tan, A.-h., & Quek, C.
(2021). Hierarchical reinforcement learning: A
comprehensive survey. ACM Computing Sur-
veys, 54(5), 1–35. 398
Pathak, D., Krahenbuhl, P., Donahue, J., Dar-
rell, T., & Efros, A. A. (2016). Con-
text encoders: Feature learning by inpaint-
ing. IEEE/CVF Computer Vision & Pattern
Recognition, 2536–2544. 159
Patrick, M., Campbell, D., Asano, Y., Misra, I.,
Metze, F., Feichtenhofer, C., Vedaldi, A., &
Henriques, J. F. (2021). Keeping your eye on
the ball: Trajectory attention in video trans-
formers. Neural Information Processing Sys-
tems, 34, 12493–12506. 238
Peluchetti, S., & Favaro, S. (2020). Innitely deep
neural networks as diusion processes. Inter-
national Conference on Articial Intelligence
and Statistics, 1126–1136. 324
Peng, C., Guo, P., Zhou, S. K., Patel, V., & Chel-
lappa, R. (2022). Towards performant and reli-
able undersampled MR reconstruction via dif-
fusion model sampling. Medical Image Com-
puting and Computer Assisted Intervention,
13436, 623–633. 369
Pennington, J., & Bahri, Y. (2017). Geometry of
neural network loss surfaces via random matrix
theory. International Conference on Machine
Learning, 2798–2806. 405
Perarnau, G., Van De Weijer, J., Raducanu, B.,
& Álvarez, J. M. (2016). Invertible conditional
GANs for image editing. NIPS 2016 Workshop
on Adversarial Training. 301
Pereyra, G., Tucker, G., Chorowski, J., Kaiser, �.,
& Hinton, G. (2017). Regularizing neural net-
works by penalizing condent output distribu-
tions. International Conference on Learning
Representations Workshop. 158
Peters, J., & Schaal, S. (2008). Reinforcement
learning of motor skills with policy gradients.
Neural Networks, 21(4), 682–697. 397
Peyré, G., Cuturi, M., et al. (2019). Computa-
tional optimal transport with applications to
data science. Foundations and Trends in Ma-
chine Learning, 11(5-6), 355–607. 299
Pezeshki, M., Mitra, A., Bengio, Y., & Lajoie, G.
(2022). Multi-scale feature learning dynamics:
Insights for double descent. International Con-
ference on Machine Learning, 17669–17690.
134
Pham, T., Tran, T., Phung, D., & Venkatesh, S.
(2017). Column networks for collective classi-
cation. AAAI Conference on Articial Intel-
ligence, 2485–2491. 263
Draft: please send errata to udlbookmail@gmail.com.
494 Bibliography
Phuong, M., & Hutter, M. (2022). Formal al-
gorithms for transformers. Technical Report,
DeepMind. 233
Pieters, M., & Wiering, M. (2018). Com-
paring generative adversarial network tech-
niques for image creation and modication.
arXiv:1803.09093. 299
Pintea, S. L., Tömen, N., Goes, S. F., Loog, M.,
& van Gemert, J. C. (2021). Resolution learn-
ing in deep convolutional networks using scale-
space theory. IEEE Transactions on Image
Processing, 30, 8342–8353. 183
Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B.,
& Liao, Q. (2017). Why and when can deep-
but not shallow-networks avoid the curse of di-
mensionality: A review. International Jour-
nal of Automation and Computing, 14(5), 503–
519. 53
Polyak, B. T. (1964). Some methods of speeding up
the convergence of iteration methods. USSR
Computational Mathematics and Mathemati-
cal Physics, 4(5), 1–17. 92
Poole, B., Jain, A., Barron, J. T., & Mildenhall,
B. (2023). DreamFusion: Text-to-3D using 2D
diusion. International Conference on Learn-
ing Representations. 369
Power, A., Burda, Y., Edwards, H., Babuschkin,
I., & Misra, V. (2022). Grokking: General-
ization beyond overtting on small algorithmic
datasets. arXiv:2201.02177. 412
Prenger, R., Valle, R., & Catanzaro, B. (2019).
Waveglow: A ow-based generative network
for speech synthesis. IEEE International Con-
ference on Acoustics, Speech and Signal Pro-
cessing, 3617–3621. 322, 323
Prince, S. J. D. (2012). Computer vision: Models,
learning, and inference. Cambridge University
Press. 15, 159
Prince, S. J. D. (2021a). Transformers II: Exten-
sions. https://www.borealisai.com/en/blog/
tutorial-16-transformers-ii-extensions/.
236, 237
Prince, S. J. D. (2021b). Transformers III: Train-
ing. https://www.borealisai.com/en/blog/
tutorial- 17- transformers- iii- training/.
238
Prince, S. J. D. (2022). Explainability I: local post-
hoc explanations. https://www.borealisai.
com / research - blogs / explainability - i -
local-post-hoc-explanations/. 426
Prokudin, S., Gehler, P., & Nowozin, S. (2018).
Deep directional statistics: Pose estimation
with uncertainty quantication. European
Conference on Computer Vision, 534–551. 74
Provilkov, I., Emelianenko, D., & Voita, E. (2020).
BPE-Dropout: Simple and eective subword
regularization. Meeting of the Association for
Computational Linguistics, 1882–1892. 234
Qi, G.-J. (2020). Loss-sensitive generative adver-
sarial networks on Lipschitz densities. Inter-
national Journal of Computer Vision, 128(5),
1118–1140. 299
Qi, J., Du, J., Siniscalchi, S. M., Ma, X., & Lee,
C.-H. (2020). On mean absolute error for deep
neural network based vector-to-vector regres-
sion. IEEE Signal Processing Letters, 27 ,
1485––1489. 73
Qin, Z., Yu, F., Liu, C., & Chen, X. (2018). How
convolutional neural network see the world
A survey of convolutional neural network visu-
alization methods. arXiv:1804.11191. 184
Qiu, S., Xu, B., Zhang, J., Wang, Y., Shen,
X., De Melo, G., Long, C., & Li, X. (2020).
EasyAug: An automatic textual data augmen-
tation platform for classication tasks. Com-
panion Proceedings of the Web Conference
2020, 249–252. 160
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A.,
Goh, G., Agarwal, S., Sastry, G., Askell, A.,
Mishkin, P., Clark, J., et al. (2021). Learning
transferable visual models from natural lan-
guage supervision. International Conference
on Machine Learning, 8748–8763. 238, 370
Radford, A., Metz, L., & Chintala, S. (2015). Un-
supervised representation learning with deep
convolutional generative adversarial networks.
International Conference on Learning Repre-
sentations. 280, 299
Radford, A., Wu, J., Child, R., Luan, D., Amodei,
D., Sutskever, I., et al. (2019). Language mod-
els are unsupervised multitask learners. Ope-
nAI Blog, 1(8), 9. 159, 234
Rae, J. W., Borgeaud, S., Cai, T., Millican,
K., Homann, J., Song, F., Aslanides, J.,
Henderson, S., Ring, R., Young, S., et al.
(2021). Scaling language models: Meth-
ods, analysis & insights from training Gopher.
arXiv:2112.11446. 234
Rael, C., Shazeer, N., Roberts, A., Lee, K.,
Narang, S., Matena, M., Zhou, Y., Li, W.,
Liu, P. J., et al. (2020). Exploring the limits
of transfer learning with a unied text-to-text
transformer. Journal of Machine Learning Re-
search, 21(140), 1–67. 236
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 495
Raji, I. D., & Buolamwini, J. (2019). Actionable
auditing: Investigating the impact of publicly
naming biased performance results of commer-
cial AI products. AAAI/ACM Conference on
AI, Ethics, and Society, 429–435. 423
Raji, I. D., & Fried, G. (2020). About face: A
survey of facial recognition evaluation. AAAI
Workshop on AI Evaluation. 427
Raji, I. D., Kumar, I. E., Horowitz, A., & Selbst, A.
(2022). The fallacy of AI functionality. ACM
Conference on Fairness, Accountability, and
Transparency, 959–972. 423, 427
Rajpurkar, P., Chen, E., Banerjee, O., & Topol,
E. J. (2022). AI in health and medicine. Nature
Medicine, 28(1), 31–38. 420
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P.
(2016). SQuAD: 100,000+ questions for ma-
chine comprehension of text. Empirical Meth-
ods in Natural Language Processing, 2383–
2392. 234
Ramachandran, P., Zoph, B., & Le, Q. V.
(2017). Searching for activation functions.
arXiv:1710.05941. 38
Ramesh, A., Dhariwal, P., Nichol, A., Chu,
C., & Chen, M. (2022). Hierarchical text-
conditional image generation with CLIP la-
tents. arXiv:2204.06125. 10, 11, 238, 369, 370
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss,
C., Radford, A., Chen, M., & Sutskever, I.
(2021). Zero-shot text-to-image generation. In-
ternational Conference on Machine Learning,
8821–8831. 238, 370
Ramsauer, H., Schä, B., Lehner, J., Seidl, P.,
Widrich, M., Adler, T., Gruber, L., Holzleit-
ner, M., Pavlović, M., Sandve, G. K., et al.
(2021). Hopeld networks is all you need. In-
ternational Conference on Learning Represen-
tations. 236
Ranganath, R., Tran, D., & Blei, D. (2016). Hierar-
chical variational models. International Con-
ference on Machine Learning, 324–333. 345
Ravanbakhsh, S., Lanusse, F., Mandelbaum, R.,
Schneider, J., & Poczos, B. (2017). Enabling
dark energy science with deep generative mod-
els of galaxy images. AAAI Conference on Ar-
ticial Intelligence, 1488–1494. 344
Rawat, W., & Wang, Z. (2017). Deep convolu-
tional neural networks for image classication:
A comprehensive review. Neural Computation,
29(9), 2352–2449. 181
Rawls, J. (1971). A Theory of Justice. Belknap
Press. 430
Razavi, A., Oord, A. v. d., Poole, B., & Vinyals,
O. (2019a). Preventing posterior collapse
with delta-VAEs. International Conference on
Learning Representations. 345
Razavi, A., Van den Oord, A., & Vinyals, O.
(2019b). Generating diverse high-delity im-
ages with VQ-VAE-2. Neural Information Pro-
cessing Systems, 32, 14837–14847. 344, 345
Recht, B., Re, C., Wright, S., & Niu, F. (2011).
Hogwild!: A lock-free approach to paralleliz-
ing stochastic gradient descent. Neural Infor-
mation Processing Systems, 24, 693–701. 114
Reddi, S. J., Kale, S., & Kumar, S. (2018). On
the convergence of Adam and beyond. Inter-
national Conference on Learning Representa-
tions. 93
Redmon, J., Divvala, S., Girshick, R., & Farhadi,
A. (2016). You only look once: Unied, real-
time object detection. IEEE/CVF Computer
Vision & Pattern Recognition, 779–788. 178,
184
Reed, S., Akata, Z., Yan, X., Logeswaran, L.,
Schiele, B., & Lee, H. (2016a). Generative ad-
versarial text to image synthesis. International
Conference on Machine Learning, 1060–1069.
301
Reed, S. E., Akata, Z., Mohan, S., Tenka, S.,
Schiele, B., & Lee, H. (2016b). Learning what
and where to draw. Neural Information Pro-
cessing Systems, 29, 217–225. 301
Reiss, J., & Sprenger, J. (2017). Scientic Objec-
tivity. The Stanford Encyclopedia of Philoso-
phy. 431
Ren, S., He, K., Girshick, R., & Sun, J. (2015).
Faster R-CNN: Towards real-time object de-
tection with region proposal networks. Neural
Information Processing Systems, 28. 183
Rezende, D. J., & Mohamed, S. (2015). Variational
inference with normalizing ows. International
Conference on Machine Learning, 1530–1538.
273, 321, 322, 344
Rezende, D. J., Mohamed, S., & Wierstra, D.
(2014). Stochastic backpropagation and ap-
proximate inference in deep generative models.
International Conference on Machine Learn-
ing, 1278–1286. 346
Rezende, D. J., Racanière, S., Higgins, I., & Toth,
P. (2019). Equivariant Hamiltonian ows.
arXiv:1909.13739. 324
Rezende Jimenez, D., Eslami, S., Mohamed, S.,
Battaglia, P., Jaderberg, M., & Heess, N.
(2016). Unsupervised learning of 3D structure
Draft: please send errata to udlbookmail@gmail.com.
496 Bibliography
from images. Neural Information Processing
Systems, 29, 4997–5005. 344
Riad, R., Teboul, O., Grangier, D., & Zeghidour,
N. (2022). Learning strides in convolutional
neural networks. International Conference on
Learning Representations. 183
Ribeiro, M., Singh, S., & Guestrin, C. (2016).
“Why should I trust you?”: Explaining the
predictions of any classier. Meeting of the As-
sociation for Computational Linguistics, 97–
101. 425
Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S.
(2021). Beyond accuracy: Behavioral testing of
NLP models with CheckList. 4824–4828. 234
Richardson, E., Alaluf, Y., Patashnik, O., Nitzan,
Y., Azar, Y., Shapiro, S., & Cohen-Or,
D. (2021). Encoding in style: A Style-
GAN encoder for image-to-image transla-
tion. IEEE/CVF Computer Vision & Pattern
Recognition, 2287–2296. 301
Riedl, M. (2020). AI democratization in the era of
GPT-3. The Gradient, Sept 25, 2020. https:
//thegradient.pub/ai-democratization-in-
the-era-of-gpt-3/. 430
Riedmiller, M. (2005). Neural tted Q iteration
rst experiences with a data ecient neural re-
inforcement learning method. European Con-
ference on Machine Learning, 317–328. 396
Rippel, O., & Adams, R. P. (2013). High-
dimensional probability estimation with deep
density models. arXiv:1302.5125. 321
Rissanen, J. (1983). A universal prior for inte-
gers and estimation by minimum description
length. The Annals of Statistics, 11(2), 416–
431. 411
Rissanen, S., Heinonen, M., & Solin, A. (2022).
Generative modelling with inverse heat dissi-
pation. arXiv:2206.13397. 369
Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z.,
Liu, J., Guo, D., Ott, M., Zitnick, C. L., Ma,
J., et al. (2021). Biological structure and func-
tion emerge from scaling unsupervised learning
to 250 million protein sequences. Proceedings
of the National Academy of Sciences, 118(15).
234
Robbins, H., & Monro, S. (1951). A stochastic
approximation method. The Annals of Math-
ematical Statistics, 22(3), 400–407. 91
Rodrigues, F., & Pereira, F. C. (2020). Beyond
expectation: Deep joint mean and quantile re-
gression for spatiotemporal problems. IEEE
Transactions on Neural Networks and Learn-
ing Systems, 31(12), 5377–5389. 73
Roeder, G., Wu, Y., & Duvenaud, D. K. (2017).
Sticking the landing: Simple, lower-variance
gradient estimators for variational inference.
Neural Information Processing Systems, 30,
6925–6934. 346
Roich, D., Mokady, R., Bermano, A. H., & Cohen-
Or, D. (2022). Pivotal tuning for latent-based
editing of real images. ACM Transactions on
Graphics (TOG), 42(1), 1–13. 300, 301
Rolfe, J. T. (2017). Discrete variational autoen-
coders. International Conference on Learning
Representations. 344
Rolnick, D., Donti, P. L., Kaack, L. H., Kochan-
ski, K., Lacoste, A., Sankaran, K., Ross, A. S.,
Milojevic-Dupont, N., Jaques, N., Waldman-
Brown, A., Luccioni, A. S., Maharaj, T., Sher-
win, E. D., Mukkavilli, S. K., Kording, K. P.,
Gomes, C. P., Ng, A. Y., Hassabis, D., Platt,
J. C., Creutzig, F., Chayes, J. T., & Bengio, Y.
(2023). Tackling climate change with machine
learning. ACM Computing Surveys, 55(2), 1–
42. 420
Rombach, R., Blattmann, A., Lorenz, D., Esser,
P., & Ommer, B. (2022). High-resolution
image synthesis with latent diusion mod-
els. IEEE/CVF Computer Vision & Pattern
Recognition, 10684–10695. 370
Romero, D. W., Bruintjes, R.-J., Tomczak, J. M.,
Bekkers, E. J., Hoogendoorn, M., & van
Gemert, J. C. (2021). FlexConv: Continuous
kernel convolutions with dierentiable kernel
sizes. International Conference on Learning
Representations. 183
Rong, Y., Huang, W., Xu, T., & Huang, J. (2020).
DropEdge: Towards deep graph convolutional
networks on node classication. International
Conference on Learning Representations. 264
Ronneberger, O., Fischer, P., & Brox, T. (2015).
U-Net: Convolutional networks for biomedical
image segmentation. International Conference
on Medical Image Computing and Computer-
Assisted Intervention, 234–241. 184, 198, 205
Rosenblatt, F. (1958). The perceptron: A proba-
bilistic model for information storage and or-
ganization in the brain. Psychological review,
65(6), 386. 37
Rossi, E., Frasca, F., Chamberlain, B., Eynard, D.,
Bronstein, M., & Monti, F. (2020). SIGN: Scal-
able inception graph neural networks. ICML
Graph Representation Learning and Beyond
Workshop, 7 , 15. 263
Roy, A., Saar, M., Vaswani, A., & Grangier, D.
(2021). Ecient content-based sparse atten-
tion with routing transformers. Transactions
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 497
of the Association for Computational Linguis-
tics, 9, 53–68. 237
Rozemberczki, B., Kiss, O., & Sarkar, R. (2020).
Little ball of fur: A Python library for graph
sampling. ACM International Conference on
Information & Knowledge Management, 3133–
3140. 264
Rubin, D. B., & Thayer, D. T. (1982). EM algo-
rithms for ML factor analysis. Psychometrika,
47(1), 69–76. 344
Ruder, S. (2016). An overview of gradient descent
optimization algorithms. arXiv:1609.04747 .
91
Rumelhart, D. E., Hinton, G. E., & Williams, R. J.
(1985). Learning internal representations by
error propagation. Techical Report, La Jolla
Institute for Cognitive Science, UCSD. 113,
233, 344
Rumelhart, D. E., Hinton, G. E., & Williams,
R. J. (1986). Learning representations by back-
propagating errors. Nature, 323(6088), 533–
536. 113
Rummery, G. A., & Niranjan, M. (1994). On-line
Q-learning using connectionist systems. Tech-
nical Report, University of Cambridge. 396
Russakovsky, O., Deng, J., Su, H., Krause, J.,
Satheesh, S., Ma, S., Huang, Z., Karpathy,
A., Khosla, A., Bernstein, M., et al. (2015).
ImageNet large scale visual recognition chal-
lenge. International Journal of Computer Vi-
sion, 115(3), 211–252. 175, 181
Russell, S. (2019). Human Compatible: Arti-
cial Intelligence and the Problem of Control.
Viking. 420
Sabour, S., Frosst, N., & Hinton, G. E. (2017).
Dynamic routing between capsules. Neural In-
formation Processing Systems, 30, 3856–3866.
235
Safran, I., & Shamir, O. (2017). Depth-width
tradeos in approximating natural functions
with neural networks. International Confer-
ence on Machine Learning, 2979–2987. 53
Saha, S., Singh, G., Sapienza, M., Torr, P. H., &
Cuzzolin, F. (2016). Deep learning for detect-
ing multiple space-time action tubes in videos.
British Machine Vision Conference. 182
Saharia, C., Chan, W., Chang, H., Lee, C., Ho,
J., Salimans, T., Fleet, D., & Norouzi, M.
(2022a). Palette: Image-to-image diusion
models. ACM SIGGRAPH. 8, 369
Saharia, C., Chan, W., Saxena, S., Li, L., Whang,
J., Denton, E., Ghasemipour, S. K. S., Ayan,
B. K., Mahdavi, S. S., Lopes, R. G., et al.
(2022b). Photorealistic text-to-image diu-
sion models with deep language understanding.
arXiv:2205.11487. 366, 368, 369, 371
Saharia, C., Ho, J., Chan, W., Salimans, T.,
Fleet, D. J., & Norouzi, M. (2022c). Image
super-resolution via iterative renement. IEEE
Transactions on Pattern Analysis & Machine
Intelligence, 1–14. 369
Sainath, T. N., Kingsbury, B., Mohamed, A.-
r., Dahl, G. E., Saon, G., Soltau, H., Be-
ran, T., Aravkin, A. Y., & Ramabhadran, B.
(2013). Improvements to deep convolutional
neural networks for LVCSR. IEEE Workshop
on Automatic Speech Recognition and Under-
standing, 315–320. 182
Saito, Y., Takamichi, S., & Saruwatari, H. (2017).
Statistical parametric speech synthesis in-
corporating generative adversarial networks.
IEEE/ACM Transactions on Audio, Speech,
and Language Processing, 26(1), 84–96. 299,
301
Salamon, J., & Bello, J. P. (2017). Deep convolu-
tional neural networks and data augmentation
for environmental sound classication. IEEE
Signal Processing Letters, 24(3), 279–283. 160
Salimans, T., Goodfellow, I., Zaremba, W., Che-
ung, V., Radford, A., & Chen, X. (2016). Im-
proved techniques for training GANs. Neu-
ral Information Processing Systems, 29, 2226–
2234. 274, 299, 300
Salimans, T., & Ho, J. (2022). Progressive dis-
tillation for fast sampling of diusion models.
International Conference on Learning Repre-
sentations. 370
Salimans, T., Kingma, D., & Welling, M. (2015).
Markov chain Monte Carlo and variational in-
ference: Bridging the gap. International Con-
ference on Machine Learning, 1218–1226. 345
Salimans, T., & Kingma, D. P. (2016). Weight nor-
malization: A simple reparameterization to ac-
celerate training of deep neural networks. Neu-
ral Information Processing Systems, 29, 901–
909. 204
Sanchez-Lengeling, B., Reif, E., Pearce, A., &
Wiltschko, A. B. (2021). A gentle introduc-
tion to graph neural networks. Distill, https:
//distill.pub/2021/gnn-intro/. 261
Sankararaman, K. A., De, S., Xu, Z., Huang,
W. R., & Goldstein, T. (2020). The impact of
neural network overparameterization on gradi-
ent confusion and stochastic gradient descent.
International Conference on Machine Learn-
ing, 8469–8479. 202
Draft: please send errata to udlbookmail@gmail.com.
498 Bibliography
Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A.
(2018). How does batch normalization help op-
timization? Neural Information Processing
Systems, 31, 2488–2498. 204
Sauer, A., Schwarz, K., & Geiger, A. (2022).
StyleGAN-XL: Scaling StyleGAN to large di-
verse datasets. ACM SIGGRAPH. 10
Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner,
M., & Monfardini, G. (2008). The graph neural
network model. IEEE Transactions on Neural
Networks, 20(1), 61–80. 262
Schaul, T., Quan, J., Antonoglou, I., & Silver, D.
(2016). Prioritized experience replay. Inter-
national Conference on Learning Representa-
tions. 396
Scherer, D., Müller, A., & Behnke, S. (2010). Eval-
uation of pooling operations in convolutional
architectures for object recognition. Interna-
tional Conference on Articial Neural Net-
works, 92–101. 181
Schlag, I., Irie, K., & Schmidhuber, J. (2021). Lin-
ear transformers are secretly fast weight pro-
grammers. International Conference on Ma-
chine Learning, 9355–9366. 235
Schlichtkrull, M., Kipf, T. N., Bloem, P., Berg, R.
v. d., Titov, I., & Welling, M. (2018). Modeling
relational data with graph convolutional net-
works. European Semantic Web Conference,
593–607. 265
Schmidhuber, J. (2022). Annotated history of mod-
ern AI and deep learning. arXiv:2212.11279.
37
Schneider, S., Baevski, A., Collobert, R., &
Auli, M. (2019). wav2vec: Unsupervised
pre-training for speech recognition. INTER-
SPEECH, 3465–3469. 159
Schrittwieser, J., Antonoglou, I., Hubert, T., Si-
monyan, K., Sifre, L., Schmitt, S., Guez, A.,
Lockhart, E., Hassabis, D., Graepel, T., et al.
(2020). Mastering Atari, Go, chess and shogi
by planning with a learned model. Nature,
588(7839), 604–609. 398
Schroecker, Y., Vecerik, M., & Scholz, J. (2019).
Generative predecessor models for sample-
ecient imitation learning. International Con-
ference on Learning Representations. 322
Schuhmann, C., Vencu, R., Beaumont, R., Kacz-
marczyk, R., Mullis, C., Katta, A., Coombes,
T., Jitsev, J., & Komatsuzaki, A. (2021).
Laion-400m: Open dataset of clip-ltered 400
million image-text pairs. NeurIPS Workshop
on Data-centric AI. 238
Schulman, J., Levine, S., Abbeel, P., Jordan, M.,
& Moritz, P. (2015). Trust region policy op-
timization. International Conference on Ma-
chine Learning, 1889–1897. 397
Schulman, J., Moritz, P., Levine, S., Jordan, M.,
& Abbeel, P. (2016). High-dimensional contin-
uous control using generalized advantage esti-
mation. International Conference on Learning
Representations. 398
Schulman, J., Wolski, F., Dhariwal, P., Radford,
A., & Klimov, O. (2017). Proximal policy op-
timization algorithms. arXiv:1707.06347. 397
Schuster, M., & Nakajima, K. (2012). Japanese
and Korean voice search. IEEE International
Conference on Acoustics, Speech and Signal
Processing, 5149–5152. 234
Schwarz, J., Jayakumar, S., Pascanu, R., Latham,
P., & Teh, Y. (2021). Powerpropagation: A
sparsity inducing weight reparameterisation.
Neural Information Processing Systems, 34,
28889–28903. 156
Sejnowski, T. J. (2018). The deep learning revolu-
tion. MIT press. 37
Sejnowski, T. J. (2020). The unreasonable eec-
tiveness of deep learning in articial intelli-
gence. Proceedings of the National Academy
of Sciences, 117 (48), 30033–30038. 404
Selsam, D., Lamm, M., Bünz, B., Liang, P.,
de Moura, L., & Dill, D. L. (2019). Learn-
ing a SAT solver from single-bit supervision.
International Conference on Learning Repre-
sentations. 262
Selva, J., Johansen, A. S., Escalera, S., Nas-
rollahi, K., Moeslund, T. B., & Clapés,
A. (2022). Video transformers: A survey.
arXiv:2201.05991. 238
Sennrich, R., Haddow, B., & Birch, A. (2015).
Neural machine translation of rare words with
subword units. Meeting of the Association for
Computational Linguistics. 234
Serra, T., Tjandraatmadja, C., & Ramalingam, S.
(2018). Bounding and counting linear regions
of deep neural networks. International Con-
ference on Machine Learning, 4558–4566. 52
Shang, W., Sohn, K., Almeida, D., & Lee, H.
(2016). Understanding and improving convolu-
tional neural networks via concatenated recti-
ed linear units. International Conference on
Machine Learning, 2217–2225. 38
Sharif Razavian, A., Azizpour, H., Sullivan, J., &
Carlsson, S. (2014). CNN features o-the-shelf:
An astounding baseline for recognition. IEEE
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 499
Conference on Computer Vision and Pattern
Recognition Workshop, 806–813. 159
Sharkey, A., & Sharkey, N. (2012). Granny and
the robots: Ethical issues in robot care for the
elderly. Ethics and Information Technology,
14(1), 27–40. 424
Shaw, P., Uszkoreit, J., & Vaswani, A. (2018).
Self-attention with relative position represen-
tations. ACL Human Language Technologies,
464–468.
236
Shen, S., Yao, Z., Gholami, A., Mahoney, M., &
Keutzer, K. (2020a). PowerNorm: Rethink-
ing batch normalization in transformers. In-
ternational Conference on Machine Learning,
8741–8751. 237
Shen, X., Tian, X., Liu, T., Xu, F., & Tao, D.
(2017). Continuous dropout. IEEE Transac-
tions on Neural Networks and Learning Sys-
tems, 29(9), 3926–3937. 158
Shen, Y., Gu, J., Tang, X., & Zhou, B. (2020b). In-
terpreting the latent space of GANs for seman-
tic face editing. IEEE/CVF Computer Vision
& Pattern Recognition, 9243–9252. 300
Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken,
A. P., Bishop, R., Rueckert, D., & Wang,
Z. (2016). Real-time single image and video
super-resolution using an ecient sub-pixel
convolutional neural network. IEEE/CVF
Computer Vision & Pattern Recognition,
1874–1883. 182
Shoeybi, M., Patwary, M., Puri, R., LeGres-
ley, P., Casper, J., & Catanzaro, B. (2019).
Megatron-LM: Training multi-billion parame-
ter language models using model parallelism.
arXiv:1909.08053. 114
Shorten, C., & Khoshgoftaar, T. M. (2019). A
survey on image data augmentation for deep
learning. Journal of Big Data, 6(1), 1–48. 159
Siddique, N., Paheding, S., Elkin, C. P., & Dev-
abhaktuni, V. (2021). U-Net and its variants
for medical image segmentation: A review of
theory and applications. IEEE Access, 82031–
82057. 205
Sifre, L., & Mallat, S. (2013). Rotation, scaling
and deformation invariant scattering for tex-
ture discrimination. IEEE/CVF Computer Vi-
sion & Pattern Recognition, 1233–1240. 183
Silver, D., Huang, A., Maddison, C. J., Guez, A.,
Sifre, L., Van Den Driessche, G., Schrittwieser,
J., Antonoglou, I., Panneershelvam, V., Lanc-
tot, M., et al. (2016). Mastering the game of
Go with deep neural networks and tree search.
Nature, 529(7587), 484–489. 396, 398
Silver, D., Lever, G., Heess, N., Degris, T., Wier-
stra, D., & Riedmiller, M. (2014). Determinis-
tic policy gradient algorithms. International
Conference on Machine Learning, 387–395.
397
Simonovsky, M., & Komodakis, N. (2018). Graph-
VAE: Towards generation of small graphs using
variational autoencoders. International Con-
ference on Articial Neural Networks, 412–
422. 344
Simonyan, K., & Zisserman, A. (2014). Very
deep convolutional networks for large-scale im-
age recognition. International Conference on
Learning Representations. 177, 181
Singh, S. P., & Sutton, R. S. (1996). Reinforce-
ment learning with replacing eligibility traces.
Machine learning, 22(1), 123–158. 396
Sinha, S., Zhao, Z., Goyal, A., Rael, C., & Odena,
A. (2020). Top-k training of GANs: Improving
GAN performance by throwing away bad sam-
ples. Neural Information Processing Systems,
33, 14638–14649. 299
Sisson, M., Spindel, J., Scharre, P., & Kozyulin, V.
(2020). The militarization of articial intelli-
gence. United Nations Oce for Disarmament
Aairs. 427
Sjöberg, J., & Ljung, L. (1995). Overtraining, reg-
ularization and searching for a minimum, with
application to neural networks. International
Journal of Control, 62(6), 1391–1407. 157
Smith, M., & Miller, S. (2022). The ethical applica-
tion of biometric facial recognition technology.
AI & Society, 37, 167–175. 426
Smith, S., Elsen, E., & De, S. (2020). On the gener-
alization benet of noise in stochastic gradient
descent. International Conference on Machine
Learning, 9058–9067. 157
Smith, S., Patwary, M., Norick, B., LeGres-
ley, P., Rajbhandari, S., Casper, J., Liu,
Z., Prabhumoye, S., Zerveas, G., Korthikanti,
V., et al. (2022). Using DeepSpeed and
Megatron to train Megatron-Turing NLG
530B, a large-scale generative language model.
arXiv:2201.11990. 234
Smith, S. L., Dherin, B., Barrett, D. G. T., & De,
S. (2021). On the origin of implicit regular-
ization in stochastic gradient descent. Inter-
national Conference on Learning Representa-
tions. 157
Smith, S. L., Kindermans, P., Ying, C., & Le, Q. V.
(2018). Don’t decay the learning rate, increase
the batch size. International Conference on
Learning Representations. 92
Draft: please send errata to udlbookmail@gmail.com.
500 Bibliography
Snoek, J., Larochelle, H., & Adams, R. P. (2012).
Practical Bayesian optimization of machine
learning algorithms. Neural Information Pro-
cessing Systems, vol. 25, 2951–2959. 136
Sohl-Dickstein, J., Weiss, E., Maheswaranathan,
N., & Ganguli, S. (2015). Deep unsuper-
vised learning using nonequilibrium thermody-
namics. International Conference on Machine
Learning, 2256–2265. 274, 367
Sohn, K., Lee, H., & Yan, X. (2015). Learning
structured output representation using deep
conditional generative models. Neural Infor-
mation Processing Systems, 28, 3483–3491.
344
Sohoni, N. S., Aberger, C. R., Leszczynski, M.,
Zhang, J., & Ré, C. (2019). Low-memory
neural network training: A technical report.
arXiv:1904.10631. 114
Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby,
S. K., & Winther, O. (2016a). How to train
deep variational autoencoders and probabilis-
tic ladder networks. arXiv:1602.02282. 344
Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby,
S. K., & Winther, O. (2016b). Ladder varia-
tional autoencoders. Neural Information Pro-
cessing Systems, 29, 738–3746. 369
Song, J., Meng, C., & Ermon, S. (2021a). De-
noising diusion implicit models. International
Conference on Learning Representations. 370
Song, Y., & Ermon, S. (2019). Generative mod-
eling by estimating gradients of the data dis-
tribution. Neural Information Processing Sys-
tems, 32, 11895–11907. 367, 371
Song, Y., & Ermon, S. (2020). Improved tech-
niques for training score-based generative mod-
els. Neural Information Processing Systems,
33, 12438–12448. 371
Song, Y., Meng, C., & Ermon, S. (2019). Mint-
Net: Building invertible neural networks with
masked convolutions. Neural Information Pro-
cessing Systems, 32, 11002–11012. 322
Song, Y., Shen, L., Xing, L., & Ermon, S. (2021b).
Solving inverse problems in medical imaging
with score-based generative models. Inter-
national Conference on Learning Representa-
tions. 369
Song, Y., Sohl-Dickstein, J., Kingma, D. P.,
Kumar, A., Ermon, S., & Poole, B.
(2021c). Score-based generative modeling
through stochastic dierential equations. In-
ternational Conference on Learning Represen-
tations. 369, 370, 371
Springenberg, J. T., Dosovitskiy, A., Brox, T., &
Riedmiller, M. (2015). Striving for simplicity:
The all convolutional net. International Con-
ference on Learning Representations. 182
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A.
A. M., Abid, A., Fisch, A., Brown, A. R., San-
toro, A., Gupta, A., Garriga-Alonso, A., et al.
(2022). Beyond the imitation game: Quanti-
fying and extrapolating the capabilities of lan-
guage models. arXiv:2206.04615. 234
Srivastava, A., Valkov, L., Russell, C., Gutmann,
M. U., & Sutton, C. (2017). VEEGAN: Re-
ducing mode collapse in GANs using implicit
variational learning. Neural Information Pro-
cessing Systems, 30, 3308–3318. 300
Srivastava, N., Hinton, G., Krizhevsky, A.,
Sutskever, I., & Salakhutdinov, R. (2014).
Dropout: A simple way to prevent neural net-
works from overtting. Journal of Machine
Learning Research, 15(1), 1929–1958. 158
Srivastava, R. K., Gre, K., & Schmidhuber, J.
(2015). Highway networks. arXiv:1505.00387 .
202
Stark, L., & Hoey, J. (2021). The ethics of emo-
tions in articial intelligence systems. ACM
Conference on Fairness, Accountability, and
Transparency, 782–793. 427
Stark, L., & Hutson, J. (2022). Physiognomic arti-
cial intelligence. Fordham Intellectual Prop-
erty, Media & Entertainment Law Journal,
XXXII(4), 922–978. 427, 432
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D.,
Lowe, R., Voss, C., Radford, A., Amodei, D., &
Christiano, P. F. (2020). Learning to summa-
rize with human feedback. Neural Information
Processing Systems, 33, 3008–3021. 398
Strubell, E., Ganesh, A., & McCallum, A. (2019).
Energy and policy considerations for deep
learning in NLP. Meeting of the Association
for Computational Linguistics, 3645–3650. 429
Strubell, E., Ganesh, A., & McCallum, A. (2020).
Energy and policy considerations for modern
deep learning research. Meeting of the Asso-
ciation for Computational Linguistics, 13693–
13696. 429
Su, H., Jampani, V., Sun, D., Gallo, O., Learned-
Miller, E., & Kautz, J. (2019a). Pixel-adaptive
convolutional neural networks. IEEE/CVF
Computer Vision & Pattern Recognition,
11166–11175. 183
Su, J., Lu, Y., Pan, S., Wen, B., & Liu, Y. (2021).
Roformer: Enhanced transformer with rotary
position embedding. arXiv:2104.09864. 236
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 501
Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F.,
& Dai, J. (2019b). VL-BERT: Pre-training of
generic visual-linguistic representations. Inter-
national Conference on Learning Representa-
tions. 238
Sultan, M. M., Wayment-Steele, H. K., & Pande,
V. S. (2018). Transferable neural networks
for enhanced sampling of protein dynamics.
Journal of Chemical Theory and Computation,
14(4), 1887–1894. 344
Summers, C., & Dinneen, M. J. (2019). Improved
mixed-example data augmentation. Winter
Conference on Applications of Computer Vi-
sion, 1262–1270. 159
Sun, C., Myers, A., Vondrick, C., Murphy, K., &
Schmid, C. (2019). VideoBERT: A joint model
for video and language representation learn-
ing. IEEE/CVF International Conference on
Computer Vision, 7464–7473. 238
Sun, C., Shrivastava, A., Singh, S., & Gupta, A.
(2017). Revisiting unreasonable eectiveness
of data in deep learning era. IEEE/CVF In-
ternational Conference on Computer Vision,
843–852. 238
Sun, R.-Y. (2020). Optimization for deep learn-
ing: An overview. Journal of the Operations
Research Society of China, 8(2), 249–294. 91
Susmelj, I., Agustsson, E., & Timofte, R. (2017).
ABC-GAN: Adaptive blur and control for im-
proved training stability of generative adver-
sarial networks. ICML Workshop on Implicit
Models. 299
Sutskever, I., Martens, J., Dahl, G., & Hinton,
G. (2013). On the importance of initialization
and momentum in deep learning. International
Conference on Machine Learning, 1139–1147.
93
Sutton, R. S. (1984). Temporal credit assignment
in reinforcement learning. Ph.D., University
of Massachusetts Amherst. 396
Sutton, R. S. (1988). Learning to predict by
the methods of temporal dierences. Machine
learning, 3(1), 9–44. 396
Sutton, R. S., & Barto, A. G. (1999). Reinforce-
ment learning: An introduction. MIT press.
396
Sutton, R. S., & Barto, A. G. (2018). Reinforce-
ment learning: An introduction, 2nd Edition.
MIT Press. 16, 396
Sutton, R. S., McAllester, D., Singh, S., & Man-
sour, Y. (1999). Policy gradient methods for
reinforcement learning with function approxi-
mation. Neural Information Processing Sys-
tems, 12, 1057–1063. 397
Szegedy, C., Ioe, S., Vanhoucke, V., & Alemi,
A. A. (2017). Inception-v4, Inception-Resnet
and the impact of residual connections on
learning. AAAI Conference on Articial In-
telligence, 4278–4284. 181, 183, 405
Szegedy, C., Vanhoucke, V., Ioe, S., Shlens, J.,
& Wojna, Z. (2016). Rethinking the Inception
architecture for computer vision. IEEE/CVF
Computer Vision & Pattern Recognition,
2818–2826. 155, 158, 274
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J.,
Erhan, D., Goodfellow, I., & Fergus, R. (2014).
Intriguing properties of neural networks. Inter-
national Conference on Learning Representa-
tions. 414
Szeliski, R. (2022). Computer vision: Algorithms
and applications, 2nd Edition. Springer. 15
Tabak, E. G., & Turner, C. V. (2013). A family of
nonparametric density estimation algorithms.
Communications on Pure and Applied Mathe-
matics, 66(2), 145–164. 321
Tabak, E. G., & Vanden-Eijnden, E. (2010). Den-
sity estimation by dual ascent of the log-
likelihood. Communications in Mathematical
Sciences, 8(1), 217–233. 321
Taddeo, M., & Floridi, L. (2018). How AI can be
a force for good. Science, 361(6404), 751–752.
420
Tan, H., & Bansal, M. (2019). LXMERT: Learning
cross-modality encoder representations from
transformers. Empirical Methods in Natural
Language Processing, 5099–5110. 238
Tan, M., & Le, Q. (2019). EcientNet: Rethink-
ing model scaling for convolutional neural net-
works. International Conference on Machine
Learning, 6105–6114. 405
Tay, Y., Bahri, D., Metzler, D., Juan, D.-C., Zhao,
Z., & Zheng, C. (2021). Synthesizer: Rethink-
ing self-attention for transformer models. In-
ternational Conference on Machine Learning,
10183–10192. 235
Tay, Y., Bahri, D., Yang, L., Metzler, D., & Juan,
D.-C. (2020). Sparse Sinkhorn attention. In-
ternational Conference on Machine Learning,
9438–9447. 237
Tay, Y., Dehghani, M., Bahri, D., & Metzler, D.
(2023). Ecient transformers: A survey. ACM
Computing Surveys, 55(6), 109:1–109:28. 237
Tegmark, M. (2018). Life 3.0: Being human in the
age of articial intelligence. Vintage. 14
Draft: please send errata to udlbookmail@gmail.com.
502 Bibliography
Telgarsky, M. (2016). Benets of depth in neu-
ral networks. PMLR Conference on Learning
Theory, 1517–1539. 53, 417
Teru, K., Denis, E., & Hamilton, W. (2020). Induc-
tive relation prediction by subgraph reasoning.
International Conference on Machine Learn-
ing, 9448–9457. 265
Tetlock, P. E., & Gardner, D. (2016). Superfore-
casting: The Art and Science of Prediction.
Toronto: Signal, McClelland & Stewart. 435
Tewari, A., Elgharib, M., Bharaj, G., Bernard,
F., Seidel, H.-P., Pérez, P., Zollhofer, M.,
& Theobalt, C. (2020). StyleRig: Rigging
StyleGAN for 3D control over portrait im-
ages. IEEE/CVF Computer Vision & Pattern
Recognition, 6142–6151. 300
Teye, M., Azizpour, H., & Smith, K. (2018).
Bayesian uncertainty estimation for batch nor-
malized deep networks. International Confer-
ence on Machine Learning, 4907–4916. 204
Theis, L., Oord, A. v. d., & Bethge, M. (2016).
A note on the evaluation of generative models.
International Conference on Learning Repre-
sentations. 322
Thompson, W. R. (1933). On the likelihood
that one unknown probability exceeds an-
other in view of the evidence of two samples.
Biometrika, 25(3-4), 285–294. 396
Thompson, W. R. (1935). On the theory of appor-
tionment. American Journal of Mathematics,
57(2), 450–456. 396
Thoppilan, R., De Freitas, D., Hall, J., Shazeer,
N., Kulshreshtha, A., Cheng, H.-T., Jin, A.,
Bos, T., Baker, L., Du, Y., et al. (2022).
LaMDA: Language models for dialog applica-
tions. arXiv:2201.08239. 234
Tipping, M. E., & Bishop, C. M. (1999). Prob-
abilistic principal component analysis. Jour-
nal of the Royal Statistical Society: Series B,
61(3), 611–622. 344
Tolmeijer, S., Kneer, M., Sarasua, C., Christen,
M., & Bernstein, A. (2020). Implementations
in machine ethics: A survey. ACM Computing
Surveys, 53(6), 1–38. 424
Tolstikhin, I., Bousquet, O., Gelly, S., &
Schoelkopf, B. (2018). Wasserstein auto-
encoders. International Conference on Learn-
ing Representations. 345
Tomašev, N., Cornebise, J., Hutter, F., Mohamed,
S., Picciariello, A., Connelly, B., Belgrave,
D. C., Ezer, D., Haert, F. C. v. d., Mugisha,
F., et al. (2020). AI for social good: Unlocking
the opportunity for positive impact. Nature
Communications, 11(1), 2468. 420
Tomasev, N., McKee, K. R., Kay, J., & Mohamed,
S. (2021). Fairness for unobserved character-
istics: Insights from technological impacts on
queer communities. AAAI/ACM Conference
on AI, Ethics, and Society, 254–265. 424
Tomczak, J. M., & Welling, M. (2016). Im-
proving variational auto-encoders using House-
holder ow. NIPS Workshop on Bayesian Deep
Learning. 322
Tompson, J., Goroshin, R., Jain, A., LeCun, Y.,
& Bregler, C. (2015). Ecient object localiza-
tion using convolutional networks. IEEE/CVF
Computer Vision & Pattern Recognition, 648–
656. 183
Torralba, A., Freeman, W., & Isola, P. (2024).
Foundations of Computer Vision. MIT Press.
15
Touati, A., Satija, H., Romo, J., Pineau, J., &
Vincent, P. (2020). Randomized value func-
tions via multiplicative normalizing ows. Un-
certainty in Articial Intelligence, 422–432.
322
Touvron, H., Cord, M., Douze, M., Massa, F.,
Sablayrolles, A., & Jégou, H. (2021). Training
data-ecient image transformers & distillation
through attention. International Conference
on Machine Learning, 10347–10357. 238
Tran, D., Bourdev, L., Fergus, R., Torresani, L.,
& Paluri, M. (2015). Learning spatiotemporal
features with 3D convolutional networks. IEEE
International Conference on Computer Vision,
4489–4497. 182
Tran, D., Vafa, K., Agrawal, K., Dinh, L., & Poole,
B. (2019). Discrete ows: Invertible generative
models of discrete data. Neural Information
Processing Systems, 32, 14692–14701. 322, 324
Tran, D., Wang, H., Torresani, L., Ray, J., Le-
Cun, Y., & Paluri, M. (2018). A closer look at
spatiotemporal convolutions for action recogni-
tion. IEEE/CVF Computer Vision & Pattern
Recognition, 6450–6459. 181
Tsitsulin, A., Palowitch, J., Perozzi, B., & Müller,
E. (2020). Graph clustering with graph neural
networks. arXiv:2006.16904. 262
Tzen, B., & Raginsky, M. (2019). Neural
stochastic dierential equations: Deep la-
tent Gaussian models in the diusion limit.
arXiv:1905.09883. 324
Ulku, I., & Akagündüz, E. (2022). A survey on
deep learning-based architectures for semantic
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 503
segmentation on 2D images. Applied Articial
Intelligence, 36(1). 184
Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016).
Instance normalization: The missing ingredi-
ent for fast stylization. arXiv:1607.08022. 203
Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2018).
Deep image prior. IEEE/CVF Computer Vi-
sion & Pattern Recognition, 9446–9454. 418
Urban, G., Geras, K. J., Kahou, S. E., Aslan, O.,
Wang, S., Caruana, R., Mohamed, A., Phili-
pose, M., & Richardson, M. (2017). Do deep
convolutional nets really need to be deep and
convolutional? International Conference on
Learning Representations. 417, 418
Vahdat, A., Andriyash, E., & Macready, W.
(2018a). DVAE#: Discrete variational autoen-
coders with relaxed Boltzmann priors. Neu-
ral Information Processing Systems, 31, 1869–
1878. 344
Vahdat, A., Andriyash, E., & Macready, W.
(2020). Undirected graphical models as ap-
proximate posteriors. International Confer-
ence on Machine Learning, 9680–9689. 344
Vahdat, A., & Kautz, J. (2020). NVAE: A deep
hierarchical variational autoencoder. Neural
Information Processing Systems, 33, 19667–
19679. 340, 345, 369
Vahdat, A., Kreis, K., & Kautz, J. (2021). Score-
based generative modeling in latent space.
Neural Information Processing Systems, 34,
11287–11302. 370
Vahdat, A., Macready, W., Bian, Z., Khoshaman,
A., & Andriyash, E. (2018b). DVAE++: Dis-
crete variational autoencoders with overlap-
ping transformations. International Confer-
ence on Machine Learning, 5035–5044. 344
Vallor, S. (2011). Carebots and caregivers: Sus-
taining the ethical ideal of care in the 21st cen-
tury. Philosophy and Technology, 24(3), 251–
268. 429
Vallor, S. (2015). Moral deskilling and upskilling in
a new machine age: Reections on the ambigu-
ous future of character. Philosophy & Technol-
ogy, 28, 107–124. 429
Van den Oord, A., Dieleman, S., Zen, H., Si-
monyan, K., Vinyals, O., Graves, A., Kalch-
brenner, N., Senior, A., & Kavukcuoglu, K.
(2016a). WaveNet: A generative model for raw
audio. ISCA Speech Synthesis Workshop. 323
Van den Oord, A., Kalchbrenner, N., Espeholt, L.,
Vinyals, O., Graves, A., et al. (2016b). Con-
ditional image generation with PixelCNN de-
coders. Neural Information Processing Sys-
tems, 29, 4790–4798. 274
Van den Oord, A., Kalchbrenner, N., &
Kavukcuoglu, K. (2016c). Pixel recurrent neu-
ral networks. International Conference on Ma-
chine Learning, 1747–1756. 233, 345
Van den Oord, A., Li, Y., Babuschkin, I., Si-
monyan, K., Vinyals, O., Kavukcuoglu, K.,
Driessche, G., Lockhart, E., Cobo, L., Stim-
berg, F., et al. (2018). Parallel WaveNet: Fast
high-delity speech synthesis. International
Conference on Machine Learning, 3918–3926.
323
Van Den Oord, A., Vinyals, O., et al. (2017). Neu-
ral discrete representation learning. Neural In-
formation Processing Systems, 30, 6306–6315.
344, 345
Van Hasselt, H. (2010). Double Q-learning. Neu-
ral Information Processing Systems, 23, 2613–
2621. 397
Van Hasselt, H., Guez, A., & Silver, D. (2016).
Deep reinforcement learning with double Q-
learning. AAAI Conference on Articial In-
telligence, 2094–2100. 397
Van Hoof, H., Chen, N., Karl, M., van der Smagt,
P., & Peters, J. (2016). Stable reinforcement
learning with autoencoders for tactile and vi-
sual data. IEEE/RSJ International Confer-
ence on Intelligent Robots and Systems, 3928–
3934. IEEE. 344
van Wynsberghe, A., & Robbins, S. (2019). Cri-
tiquing the reasons for making articial moral
agents. Science and Engineering Ethics, 25,
719–735. 424
Vapnik, V. (1995). The nature of statistical learn-
ing theory. New York: Springer Verlag. 74
Vapnik, V. N., & Chervonenkis, A. Y. (1971). On
the uniform convergence of relative frequencies
of events to their probabilities. Measures of
Complexity, 11–30. 134
Vardi, G., Yehudai, G., & Shamir, O. (2022).
Width is less important than depth in ReLU
neural networks. PMRL Conference on Learn-
ing Theory, 1–33. 53
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit,
J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polo-
sukhin, I. (2017). Attention is all you need.
Neural Information Processing Systems, 30,
5998–6008. 158, 233, 234, 235, 236, 237
Veit, A., Wilber, M. J., & Belongie, S. (2016).
Residual networks behave like ensembles of rel-
atively shallow networks. Neural Information
Processing Systems, 29, 550–558. 202, 417
Draft: please send errata to udlbookmail@gmail.com.
504 Bibliography
Veličković, P. (2023). Everything is connected:
Graph neural networks. Current Opinion in
Structural Biology, 79, 102538. 261
Veličković, P., Cucurull, G., Casanova, A.,
Romero, A., Lio, P., & Bengio, Y. (2019).
Graph attention networks. International Con-
ference on Learning Representations. 234, 263,
265
Véliz, C. (2020). Privacy is Power: Why and How
You Should Take Back Control of Your Data.
Bantam Press. 435
Véliz, C. (2023). Chatbots shouldn’t use emojis.
Nature, 615, 375. 428
Vijayakumar, A. K., Cogswell, M., Selvaraju,
R. R., Sun, Q., Lee, S., Crandall, D., & Ba-
tra, D. (2016). Diverse beam search: Decoding
diverse solutions from neural sequence models.
arXiv:1610.02424. 235
Vincent, J. (2020). What a machine learning tool
that turns Obama white can (and can’t) tell us
about AI bias / a striking image that only hints
at a much bigger problem. The Verge, June 23,
2020. https://www.theverge.com/21298762/
face - depixelizer - ai - machine - learning -
tool-pulse-stylegan-obama-bias. 422
Vincent, P., Larochelle, H., Bengio, Y., & Man-
zagol, P.-A. (2008). Extracting and composing
robust features with denoising autoencoders.
International Conference on Machine Learn-
ing, 1096–1103. 344
Voita, E., Talbot, D., Moiseev, F., Sennrich, R.,
& Titov, I. (2019). Analyzing multi-head
self-attention: Specialized heads do the heavy
lifting, the rest can be pruned. Meeting of
the Association for Computational Linguistics,
5797–5808. 235
Voleti, V., Jolicoeur-Martineau, A., & Pal, C.
(2022). MCVD: Masked conditional video dif-
fusion for prediction, generation, and interpo-
lation. Neural Information Processing Sys-
tems, 35. 369
Vondrick, C., Pirsiavash, H., & Torralba, A.
(2016). Generating videos with scene dynam-
ics. Neural Information Processing Systems,
29, 613–621. 299
Wachter, S., Mittelstadt, B., & Floridi, L. (2017).
Why a right to explanation of automated
decision-making does not exist in the general
data protection regulation. International Data
Privacy Law, 7(2), 76–99. 425
Waibel, A., Hanazawa, T., Hinton, G., Shikano,
K., & Lang, K. J. (1989). Phoneme recogni-
tion using time-delay neural networks. IEEE
Transactions on Acoustics, Speech, and Signal
Processing, 37(3), 328–339. 181
Wallach, W., Allen, C., & Smit, I. (2008). Ma-
chine morality: Bottom-up and top-down ap-
proaches for modeling human moral faculties.
AI & Society, 22(4), 565–582. 424
Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., & Fer-
gus, R. (2013). Regularization of neural net-
works using DropConnect. International Con-
ference on Machine Learning, 1058–1066. 158
Wan, Z., Zhang, J., Chen, D., & Liao, J. (2021).
High-delity pluralistic image completion with
transformers. IEEE/CVF International Con-
ference on Computer Vision, 4692–4701. 238
Wang, A., Pruksachatkun, Y., Nangia, N., Singh,
A., Michael, J., Hill, F., Levy, O., & Bow-
man, S. (2019a). SuperGLUE: A stickier
benchmark for general-purpose language un-
derstanding systems. Neural Information Pro-
cessing Systems, 32, 3261–3275. 234
Wang, A., Singh, A., Michael, J., Hill, F., Levy,
O., & Bowman, S. R. (2019b). GLUE: A multi-
task benchmark and analysis platform for nat-
ural language understanding. International
Conference on Learning Representations. 234
Wang, B., Shang, L., Lioma, C., Jiang, X., Yang,
H., Liu, Q., & Simonsen, J. G. (2020a). On
position embeddings in BERT. International
Conference on Learning Representations. 236
Wang, C.-Y., Bochkovskiy, A., & Liao, H.-Y. M.
(2022a). Yolov7: Trainable bag-of-freebies sets
new state-of-the-art for real-time object detec-
tors. arXiv:2207.02696. 184
Wang, P. Z., & Wang, W. Y. (2019). Riemannian
normalizing ow on variational Wasserstein au-
toencoder for text modeling. ACL Human Lan-
guage Technologies, 284–294. 324
Wang, S., Li, B. Z., Khabsa, M., Fang, H., & Ma,
H. (2020b). Linformer: Self-attention with lin-
ear complexity. arXiv:2006.04768. 237
Wang, T., Liu, M., Zhu, J., Yakovenko, N., Tao, A.,
Kautz, J., & Catanzaro, B. (2018a). Video-to-
video synthesis. Neural Information Process-
ing Systems, vol. 31, 1152–1164. 299
Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao,
A., Kautz, J., & Catanzaro, B. (2018b).
High-resolution image synthesis and seman-
tic manipulation with conditional GANs.
IEEE/CVF Computer Vision & Pattern
Recognition, 8798–8807. 300, 301
Wang, W., Xie, E., Li, X., Fan, D.-P., Song,
K., Liang, D., Lu, T., Luo, P., & Shao, L.
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 505
(2021). Pyramid vision transformer: A ver-
satile backbone for dense prediction without
convolutions. IEEE/CVF International Con-
ference on Computer Vision, 568–578. 238
Wang, W., Yao, L., Chen, L., Lin, B., Cai, D.,
He, X., & Liu, W. (2022b). Crossformer: A
versatile vision transformer hinging on cross-
scale attention. International Conference on
Learning Representations. 238
Wang, X., Girshick, R., Gupta, A., & He,
K. (2018c). Non-local neural networks.
IEEE/CVF Computer Vision & Pattern
Recognition, 7794–7803. 238
Wang, X., Wang, S., Liang, X., Zhao, D., Huang,
J., Xu, X., Dai, B., & Miao, Q. (2022c).
Deep reinforcement learning: A survey. IEEE
Transactions on Neural Networks and Learn-
ing Systems. 396
Wang, Y., & Kosinski, M. (2018). Deep neural
networks are more accurate than humans at
detecting sexual orientation from facial images.
Journal of Personality and Social Psychology,
114(2), 246–257. 430
Wang, Y., Mohamed, A., Le, D., Liu, C., Xiao,
A., Mahadeokar, J., Huang, H., Tjandra,
A., Zhang, X., Zhang, F., et al. (2020c).
Transformer-based acoustic modeling for hy-
brid speech recognition. IEEE International
Conference on Acoustics, Speech and Signal
Processing, 6874–6878. 234
Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos,
R., Kavukcuoglu, K., & de Freitas, N. (2017).
Sample ecient actor-critic with experience re-
play. International Conference on Learning
Representations. 398
Wang, Z., Schaul, T., Hessel, M., van Hasselt,
H., Lanctot, M., & Freitas, N. (2016). Du-
eling network architectures for deep reinforce-
ment learning. International Conference on
Machine Learning, 1995–2003. 397
Ward, P. N., Smofsky, A., & Bose, A. J. (2019).
Improving exploration in soft-actor-critic with
normalizing ows policies. ICML Workshop on
Invertible Neural Networks and Normalizing
Flows. 322
Watkins, C. J., & Dayan, P. (1992). Q-learning.
Machine learning, 8(3-4), 279–292. 396
Watkins, C. J. C. H. (1989). Learning from delayed
rewards. Ph.D., University of Cambridge. 396
Wehenkel, A., & Louppe, G. (2019). Unconstrained
monotonic neural networks. Neural Informa-
tion Processing Systems, 32, 1543–1553. 323
Wei, J., Ren, X., Li, X., Huang, W., Liao, Y.,
Wang, Y., Lin, J., Jiang, X., Chen, X., & Liu,
Q. (2019). NEZHA: Neural contextualized rep-
resentation for Chinese language understand-
ing. arXiv:1909.00204. 236
Wei, J., & Zou, K. (2019). EDA: Easy data
augmentation techniques for boosting perfor-
mance on text classication tasks. ACL Em-
pirical Methods in Natural Language Process-
ing, 6382–6388. 160
Weidinger, L., Uesato, J., Rauh, M., Grin, C.,
Huang, P.-S., Mellor, J., Glaese, A., Cheng,
M., Balle, B., Kasirzadeh, A., Biles, C., Brown,
S., Kenton, Z., Hawkins, W., Stepleton, T.,
Birhane, A., Hendricks, L. A., Rimell, L.,
Isaac, W., Haas, J., Legassick, S., Irving, G., &
Gabriel, I. (2022). Taxonomy of risks posed by
language models. ACM Conference on Fair-
ness, Accountability, and Transparency, 214–
229. 428
Weisfeiler, B., & Leman, A. (1968). The reduction
of a graph to canonical form and the algebra
which appears therein. NTI, Series, 2(9), 12–
16. 264
Welling, M., & Teh, Y. W. (2011). Bayesian learn-
ing via stochastic gradient Langevin dynamics.
International Conference on Machine Learn-
ing, 681–688. 159
Wen, Y.-H., Yang, Z., Fu, H., Gao, L., Sun,
Y., & Liu, Y.-J. (2021). Autoregressive styl-
ized motion synthesis with generative ow.
IEEE/CVF Computer Vision & Pattern
Recognition, 13612–13621. 322
Wenzel, F., Roth, K., Veeling, B. S., Świątkowski,
J., Tran, L., Mandt, S., Snoek, J., Salimans,
T., Jenatton, R., & Nowozin, S. (2020a). How
good is the Bayes posterior in deep neural net-
works really? International Conference on
Machine Learning, 10248–10259. 159
Wenzel, F., Snoek, J., Tran, D., & Jenatton, R.
(2020b). Hyperparameter ensembles for ro-
bustness and uncertainty quantication. Neu-
ral Information Processing Systems, 33, 6514–
6527. 158
Werbos, P. (1974). Beyond regression: New tools
for prediction and analysis in the behavioral
sciences. Ph.D. dissertation, Harvard Univer-
sity. 113
White, T. (2016). Sampling generative networks.
arXiv:1609.04468. 342, 344
Whitney, H. (1932). Congruent graphs and the
connectivity of graphs. Hassler Whitney Col-
lected Papers, 61–79. 264
Draft: please send errata to udlbookmail@gmail.com.
506 Bibliography
Wightman, R., Touvron, H., & Jégou, H. (2021).
ResNet strikes back: An improved training
procedure in timm. Neural Information Pro-
cessing Systems Workshop. 202
Williams, C. K., & Rasmussen, C. E. (2006). Gaus-
sian processes for machine learning. MIT
Press. 15
Williams, P. M. (1996). Using neural networks to
model conditional multivariate densities. Neu-
ral Computation, 8(4), 843–854. 73
Williams, R. J. (1992). Simple statistical gradient-
following algorithms for connectionist rein-
forcement learning. Machine learning, 8(3),
229–256. 397
Wilson, A. C., Roelofs, R., Stern, M., Srebro, N.,
& Recht, B. (2017). The marginal value of
adaptive gradient methods in machine learn-
ing. Neural Information Processing Systems,
30, 4148–4158. 94, 410
Wirnsberger, P., Ballard, A. J., Papamakarios, G.,
Abercrombie, S., Racanière, S., Pritzel, A.,
Jimenez Rezende, D., & Blundell, C. (2020).
Targeted free energy estimation via learned
mappings. The Journal of Chemical Physics,
153(14), 144112. 322
Wolf, S. (2021). ProGAN: How NVIDIA
generated images of unprecedented qual-
ity. https : / / towardsdatascience . com /
progan- how- nvidia- generated- images- of-
unprecedented-quality-51c98ec2cbd2. 286
Wolf, V., Lugmayr, A., Danelljan, M., Van Gool,
L., & Timofte, R. (2021). DeFlow: Learn-
ing complex image degradations from unpaired
data with conditional ows. IEEE/CVF Com-
puter Vision & Pattern Recognition, 94–103.
322
Wolfe, C. R., Yang, J., Chowdhury, A., Dun,
C., Bayer, A., Segarra, S., & Kyrillidis, A.
(2021). GIST: Distributed training for large-
scale graph convolutional networks. NeurIPS
Workshop on New Frontiers in Graph Learn-
ing. 264
Wolpert, D. H. (1992). Stacked generalization.
Neural Networks, 5(2), 241–259. 158
Wong, K. W., Contardo, G., & Ho, S. (2020).
Gravitational-wave population inference with
deep ow-based generative network. Physical
Review D, 101(12), 123005. 322
Worrall, D. E., Garbin, S. J., Turmukhambetov,
D., & Brostow, G. J. (2017). Harmonic net-
works: Deep translation and rotation equivari-
ance. IEEE/CVF Computer Vision & Pattern
Recognition, 5028–5037. 183
Wu, B., Xu, C., Dai, X., Wan, A., Zhang, P., Yan,
Z., Tomizuka, M., Gonzalez, J., Keutzer, K., &
Vajda, P. (2020a). Visual transformers: Token-
based image representation and processing for
computer vision. arXiv:2006.03677. 238
Wu, F., Fan, A., Baevski, A., Dauphin, Y. N.,
& Auli, M. (2019). Pay less attention with
lightweight and dynamic convolutions. Inter-
national Conference on Learning Representa-
tions. 235
Wu, H., & Gu, X. (2015). Max-pooling dropout
for regularization of convolutional neural net-
works. Neural Information Processing Sys-
tems, vol. 18, 46–54. 183
Wu, J., Huang, Z., Thoma, J., Acharya, D., &
Van Gool, L. (2018a). Wasserstein divergence
for GANs. European Conference on Computer
Vision, 653–668. 299
Wu, J., Zhang, C., Xue, T., Freeman, B., & Tenen-
baum, J. (2016). Learning a probabilistic la-
tent space of object shapes via 3D generative-
adversarial modeling. Neural Information Pro-
cessing Systems, 29, 82–90. 299
Wu, N., Green, B., Ben, X., & O’Banion, S.
(2020b). Deep transformer models for time se-
ries forecasting: The inuenza prevalence case.
arXiv:2001.08317. 234
Wu, R., Yan, S., Shan, Y., Dang, Q., & Sun, G.
(2015a). Deep image: Scaling up image recog-
nition. arXiv:1501.02876, 7 (8). 154
Wu, S., Sun, F., Zhang, W., Xie, X., & Cui,
B. (2023). Graph neural networks in recom-
mender systems: A survey. ACM Computing
Surveys, 55(5), 97:1–97:37. 262
Wu, X., & Zhang, X. (2016). Automated
inference on criminality using face images.
arXiv:1611.04135. 427
Wu, Y., Burda, Y., Salakhutdinov, R., & Grosse,
R. (2017). On the quantitative analysis of
decoder-based generative models. Interna-
tional Conference on Learning Representa-
tions. 300
Wu, Y., & He, K. (2018). Group normalization.
European Conference on Computer Vision, 3–
19. 203, 204
Wu, Z., Lischinski, D., & Shechtman, E. (2021).
Stylespace analysis: Disentangled controls
for StyleGAN image generation. IEEE/CVF
Computer Vision & Pattern Recognition,
12863–12872. 300
Wu, Z., Nagarajan, T., Kumar, A., Rennie, S.,
Davis, L. S., Grauman, K., & Feris, R. (2018b).
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 507
BlockDrop: Dynamic inference paths in resid-
ual networks. IEEE/CVF Computer Vision &
Pattern Recognition, 8817–8826. 203
Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C.,
& Philip, S. Y. (2020c). A comprehensive sur-
vey on graph neural networks. IEEE Transac-
tions on Neural Networks and Learning Sys-
tems, 32(1), 4–24. 261
Wu, Z., Song, S., Khosla, A., Yu, F., Zhang,
L., Tang, X., & Xiao, J. (2015b). 3D
ShapeNets: A deep representation for volumet-
ric shapes. IEEE/CVF Computer Vision &
Pattern Recognition, 1912–1920. 182
Xia, F., Liu, T.-Y., Wang, J., Zhang, W., & Li, H.
(2008). Listwise approach to learning to rank:
theory and algorithm. International Confer-
ence on Machine Learning, 1192–1199. 73
Xia, W., Zhang, Y., Yang, Y., Xue, J.-H., Zhou,
B., & Yang, M.-H. (2022). GAN inversion: A
survey. IEEE Transactions on Pattern Analy-
sis and Machine Intelligence, 1–17. 301
Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz,
S., & Pennington, J. (2018a). Dynamical isom-
etry and a mean eld theory of CNNs: How to
train 10,000-layer vanilla convolutional neural
networks. International Conference on Ma-
chine Learning, 5393–5402. 114, 183
Xiao, S., Wang, S., Dai, Y., & Guo, W. (2022a).
Graph neural networks in node classication:
Survey and evaluation. Machine Vision and
Applications, 33(1), 1–19. 262
Xiao, T., Hong, J., & Ma, J. (2018b). DNA-
GAN: Learning disentangled representations
from multi-attribute images. International
Conference on Learning Representations. 301
Xiao, Z., Kreis, K., & Vahdat, A. (2022b). Tackling
the generative learning trilemma with denois-
ing diusion GANs. International Conference
on Learning Representations. 370
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Al-
varez, J. M., & Luo, P. (2021). SegFormer:
Simple and ecient design for semantic seg-
mentation with transformers. Neural Informa-
tion Processing Systems, 34, 12077–12090. 238
Xie, L., Wang, J., Wei, Z., Wang, M., & Tian, Q.
(2016). DisturbLabel: Regularizing CNN on
the loss layer. IEEE/CVF Computer Vision &
Pattern Recognition, 4753–4762. 158
Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K.
(2017). Aggregated residual transformations
for deep neural networks. IEEE/CVF Com-
puter Vision & Pattern Recognition, 1492–
1500. 181, 202, 405
Xie, Y., & Li, Q. (2022). Measurement-conditioned
denoising diusion probabilistic model for
under-sampled medical image reconstruction.
Medical Image Computing and Computer As-
sisted Intervention, vol. 13436, 655–664. 369
Xing, E. P., Ho, Q., Dai, W., Kim, J. K., Wei, J.,
Lee, S., Zheng, X., Xie, P., Kumar, A., & Yu,
Y. (2015). Petuum: A new platform for dis-
tributed machine learning on big data. IEEE
Transactions on Big Data, 1(2), 49–67. 114
Xing, Y., Qian, Z., & Chen, Q. (2021). Invert-
ible image signal processing. IEEE/CVF Com-
puter Vision & Pattern Recognition, 6287–
6296. 322
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng,
S., Xing, C., Zhang, H., Lan, Y., Wang, L., &
Liu, T. (2020a). On layer normalization in the
transformer architecture. International Con-
ference on Machine Learning, 10524–10533.
237
Xiong, Z., Yuan, Y., Guo, N., & Wang, Q. (2020b).
Variational context-deformable convnets for
indoor scene parsing. IEEE/CVF Computer
Vision & Pattern Recognition, 3992–4002. 183
Xu, B., Wang, N., Chen, T., & Li, M. (2015).
Empirical evaluation of rectied activations
in convolutional network. arXiv:1505.00853.
158, 160
Xu, K., Hu, W., Leskovec, J., & Jegelka, S. (2019).
How powerful are graph neural networks? In-
ternational Conference on Learning Represen-
tations. 264
Xu, K., Li, C., Tian, Y., Sonobe, T.,
Kawarabayashi, K.-i., & Jegelka, S. (2018).
Representation learning on graphs with jump-
ing knowledge networks. International Con-
ference on Machine Learning, 5453–5462. 263,
265, 266
Xu, K., Zhang, M., Jegelka, S., & Kawaguchi, K.
(2021a). Optimization of graph neural net-
works: Implicit acceleration by skip connec-
tions and more depth. International Confer-
ence on Machine Learning, 11592–11602. 266
Xu, P., Cheung, J. C. K., & Cao, Y. (2020).
On variational learning of controllable repre-
sentations for text without supervision. In-
ternational Conference on Machine Learning,
10534–10543. 343, 345
Xu, P., Kumar, D., Yang, W., Zi, W., Tang, K.,
Huang, C., Cheung, J. C. K., Prince, S. J. D.,
& Cao, Y. (2021b). Optimizing deeper trans-
formers on small datasets. Meeting of the As-
sociation for Computational Linguistics. 114,
234, 238
Draft: please send errata to udlbookmail@gmail.com.
508 Bibliography
Yamada, Y., Iwamura, M., Akiba, T., & Kise,
K. (2019). Shakedrop regularization for deep
residual learning. IEEE Access, 7, 186126–
186136. 202, 203
Yamada, Y., Iwamura, M., & Kise, K. (2016).
Deep pyramidal residual networks with sepa-
rated stochastic depth. arXiv:1612.01230. 202
Yan, X., Yang, J., Sohn, K., & Lee, H. (2016). At-
tribute2Image: Conditional image generation
from visual attributes. European Conference
on Computer Vision, 776–791. 301
Yang, F., Yang, H., Fu, J., Lu, H., & Guo, B.
(2020a). Learning texture transformer network
for image super-resolution. IEEE/CVF Com-
puter Vision & Pattern Recognition, 5791–
5800. 238
Yang, G., Pennington, J., Rao, V., Sohl-Dickstein,
J., & Schoenholz, S. S. (2019). A mean eld
theory of batch normalization. International
Conference on Learning Representations. 203
Yang, K., Goldman, S., Jin, W., Lu, A. X.,
Barzilay, R., Jaakkola, T., & Uhler, C.
(2021). Mol2Image: Improved conditional
ow models for molecule to image synthe-
sis. IEEE/CVF Computer Vision & Pattern
Recognition, 6688–6698. 322
Yang, Q., Zhang, Y., Dai, W., & Pan, S. J.
(2020b). Transfer learning. Cambridge Uni-
versity Press. 159
Yang, R., Srivastava, P., & Mandt, S. (2022). Dif-
fusion probabilistic modeling for video genera-
tion. arXiv:2203.09481. 369, 371
Yao, W., Zeng, Z., Lian, C., & Tang, H. (2018).
Pixel-wise regression using U-Net and its ap-
plication on pansharpening. Neurocomputing,
312, 364–371. 205
Ye, H., & Young, S. (2004). High quality voice
morphing. IEEE International Conference on
Acoustics, Speech, and Signal Processing, 1–9.
160
Ye, L., Rochan, M., Liu, Z., & Wang, Y. (2019).
Cross-modal self-attention network for refer-
ring image segmentation. IEEE/CVF Com-
puter Vision & Pattern Recognition, 10502–
10511. 238
Ye, W., Liu, S., Kurutach, T., Abbeel, P., & Gao,
Y. (2021). Mastering Atari games with limited
data. Neural Information Processing Systems,
34, 25476–25488. 396
Ying, R., He, R., Chen, K., Eksombatchai, P.,
Hamilton, W. L., & Leskovec, J. (2018a).
Graph convolutional neural networks for web-
scale recommender systems. ACM SIGKDD
International Conference on Knowledge Dis-
covery & Data Mining, 974–983. 264, 265
Ying, Z., You, J., Morris, C., Ren, X., Hamil-
ton, W., & Leskovec, J. (2018b). Hierarchi-
cal graph representation learning with dieren-
tiable pooling. Neural Information Processing
Systems, 31, 4805–4815. 265
Yoshida, Y., & Miyato, T. (2017). Spectral norm
regularization for improving the generalizabil-
ity of deep learning. arXiv:1705.10941. 156
You, Y., Chen, T., Wang, Z., & Shen, Y. (2020).
When does self-supervision help graph convo-
lutional networks? International Conference
on Machine Learning, 10871–10880. 159
Yu, F., & Koltun, V. (2015). Multi-scale con-
text aggregation by dilated convolutions. In-
ternational Conference on Learning Represen-
tations. 181
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., &
Huang, T. S. (2019). Free-form image inpaint-
ing with gated convolution. IEEE/CVF In-
ternational Conference on Computer Vision,
4471–4480. 181
Yu, J., Zheng, Y., Wang, X., Li, W., Wu, Y., Zhao,
R., & Wu, L. (2021). FastFlow: Unsupervised
anomaly detection and localization via 2D nor-
malizing ows. arXiv:2111.07677 . 322
Yu, J. J., Derpanis, K. G., & Brubaker, M. A.
(2020). Wavelet ow: Fast training of high res-
olution normalizing ows. Neural Information
Processing Systems, 33, 6184–6196. 322
Yu, L., Zhang, W., Wang, J., & Yu, Y. (2017).
SeqGAN: Sequence generative adversarial nets
with policy gradient. AAAI Conference on Ar-
ticial Intelligence, 2852–2858. 299
Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J.,
& Yoo, Y. (2019). CutMix: Regularization
strategy to train strong classiers with localiz-
able features. IEEE/CVF International Con-
ference on Computer Vision, 6023–6032. 160
Zagoruyko, S., & Komodakis, N. (2016). Wide
residual networks. British Machine Vision
Conference. 202, 417
Zagoruyko, S., & Komodakis, N. (2017). Paying
more attention to attention: Improving the
performance of convolutional neural networks
via attention transfer. International Confer-
ence on Learning Representations. 415
Zaheer, M., Kottur, S., Ravanbakhsh, S., Poc-
zos, B., Salakhutdinov, R. R., & Smola, A. J.
(2017). Deep sets. Neural Information Pro-
cessing Systems, 30, 3391–3401. 263
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 509
Zaheer, M., Reddi, S., Sachan, D., Kale, S., & Ku-
mar, S. (2018). Adaptive methods for noncon-
vex optimization. Neural Information Process-
ing Systems, 31, 9815–9825. 93
Zaslavsky, T. (1975). Facing up to arrangements:
Face-count formulas for partitions of space by
hyperplanes: Face-count formulas for parti-
tions of space by hyperplanes. Memoirs of the
American Mathematical Society. 38, 40
Zeiler, M. D. (2012). ADADELTA: An adaptive
learning rate method. arXiv:1212.5701. 93
Zeiler, M. D., & Fergus, R. (2014). Visualizing and
understanding convolutional networks. Euro-
pean Conference on Computer Vision, 818–
833. 181, 184
Zeiler, M. D., Taylor, G. W., & Fergus, R. (2011).
Adaptive deconvolutional networks for mid
and high level feature learning. IEEE Interna-
tional Conference on Computer Vision, 2018–
2025. 181
Zeng, H., Zhou, H., Srivastava, A., Kannan, R.,
& Prasanna, V. (2020). GraphSAINT: Graph
sampling based inductive learning method. In-
ternational Conference on Learning Represen-
tations. 264
Zeng, Y., Fu, J., Chao, H., & Guo, B. (2019).
Learning pyramid-context encoder network for
high-quality image inpainting. IEEE/CVF
Computer Vision & Pattern Recognition,
1486–1494. 205
Zhai, S., Talbott, W., Srivastava, N., Huang, C.,
Goh, H., Zhang, R., & Susskind, J. (2021). An
attention free transformer. 235
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J.
(2023). Dive into deep learning. Cambridge
University Press. 15
Zhang, C., Bengio, S., Hardt, M., Recht, B.,
& Vinyals, O. (2017a). Understanding deep
learning requires rethinking generalization. In-
ternational Conference on Learning Represen-
tations. 156, 403, 418
Zhang, C., Ouyang, X., & Patras, P. (2017b).
ZipNet-GAN: Inferring ne-grained mobile
trac patterns via a generative adversarial
neural network. International Conference on
emerging Networking EXperiments and Tech-
nologies, 363–375. 299
Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-
Paz, D. (2017c). mixup: Beyond empirical
risk minimization. International Conference
on Learning Representations. 160
Zhang, H., Dauphin, Y. N., & Ma, T. (2019a).
Fixup initialization: Residual learning with-
out normalization. International Conference
on Learning Representations. 114, 205
Zhang, H., Goodfellow, I., Metaxas, D., & Odena,
A. (2019b). Self-attention generative adversar-
ial networks. International Conference on Ma-
chine Learning, 7354–7363. 299
Zhang, H., Hsieh, C.-J., & Akella, V. (2016a).
Hogwild++: A new mechanism for decentral-
ized asynchronous stochastic gradient descent.
IEEE International Conference on Data Min-
ing, 629–638. 114
Zhang, H., Xu, T., Li, H., Zhang, S., Wang,
X., Huang, X., & Metaxas, D. N. (2017d).
StackGAN: Text to photo-realistic image syn-
thesis with stacked generative adversarial net-
works. IEEE/CVF International Conference
on Computer Vision, 5907–5915. 300, 301
Zhang, J., & Meng, L. (2019). GResNet: Graph
residual network for reviving deep gnns from
suspended animation. arXiv:1909.05729. 263
Zhang, J., Shi, X., Xie, J., Ma, H., King, I., &
Yeung, D.-Y. (2018a). GaAN: Gated attention
networks for learning on large and spatiotem-
poral graphs. Uncertainty in Articial Intelli-
gence, 339–349. 263
Zhang, J., Zhang, H., Xia, C., & Sun, L.
(2020). Graph-Bert: Only attention is
needed for learning graph representations.
arXiv:2001.05140. 263
Zhang, K., Yang, Z., & Başar, T. (2021a).
Multi-agent reinforcement learning: A selec-
tive overview of theories and algorithms. Hand-
book of Reinforcement Learning and Control,
321–384. 398
Zhang, L., & Agrawala, M. (2023). Adding condi-
tional control to text-to-image diusion mod-
els. arXiv:2302.05543. 370
Zhang, M., & Chen, Y. (2018). Link prediction
based on graph neural networks. Neural In-
formation Processing Systems, 31, 5171–5181.
262
Zhang, M., Cui, Z., Neumann, M., & Chen, Y.
(2018b). An end-to-end deep learning architec-
ture for graph classication. AAAI Conference
on Articial Intelligence, 4438–4445. 262, 265
Zhang, Q., & Chen, Y. (2021). Diusion normaliz-
ing ow. Neural Information Processing Sys-
tems, 34, 16280–16291. 371
Zhang, R. (2019). Making convolutional networks
shift-invariant again. International Conference
on Machine Learning, 7324–7334. 182, 183
Draft: please send errata to udlbookmail@gmail.com.
510 Bibliography
Zhang, R., Isola, P., & Efros, A. A. (2016b). Col-
orful image colorization. European Conference
on Computer Vision, 649–666. 159
Zhang, S., Tong, H., Xu, J., & Maciejewski, R.
(2019c). Graph convolutional networks: A
comprehensive review. Computational Social
Networks, 6(1), 1–23. 262
Zhang, S., Zhang, C., Kang, N., & Li, Z.
(2021b). iVPF: Numerical invertible volume
preserving ow for ecient lossless compres-
sion. IEEE/CVF Computer Vision & Pattern
Recognition, 620–629. 322
Zhang, X., Zhao, J., & LeCun, Y. (2015).
Character-level convolutional networks for text
classication. Neural Information Processing
Systems, 28, 649–657. 182
Zhao, H., Jia, J., & Koltun, V. (2020a). Ex-
ploring self-attention for image recognition.
IEEE/CVF Computer Vision & Pattern
Recognition, 10076–10085. 238
Zhao, J., Mathieu, M., & LeCun, Y. (2017a).
Energy-based generative adversarial network.
International Conference on Learning Repre-
sentations. 299
Zhao, L., & Akoglu, L. (2020). PairNorm: Tackling
oversmoothing in GNNs. International Con-
ference on Learning Representations. 265
Zhao, L., Mo, Q., Lin, S., Wang, Z., Zuo, Z., Chen,
H., Xing, W., & Lu, D. (2020b). UCTGAN: Di-
verse image inpainting based on unsupervised
cross-space translation. IEEE/CVF Computer
Vision & Pattern Recognition, 5741–5750. 238
Zhao, S., Song, J., & Ermon, S. (2017b). InfoVAE:
Balancing learning and inference in variational
autoencoders. AAAI Conference on Articial
Intelligence, 5885–5892. 345
Zhao, S., Song, J., & Ermon, S. (2017c). To-
wards deeper understanding of variational au-
toencoding models. arXiv:1702.08658. 345
Zheng, C., Cham, T.-J., & Cai, J. (2021). TFill:
Image completion via a transformer-based ar-
chitecture. arXiv:2104.00845. 238
Zheng, G., Yang, Y., & Carbonell, J.
(2017). Convolutional normalizing ows.
arXiv:1711.02255. 322
Zheng, Q., Zhang, A., & Grover, A. (2022). Online
decision transformer. International Confer-
ence on Machine Learning, 162, 27042–27059.
398
Zhong, Z., Zheng, L., Kang, G., Li, S., & Yang,
Y. (2020). Random erasing data augmenta-
tion. AAAI Conference on Articial Intelli-
gence, 13001–13008. 159
Zhou, C., Ma, X., Wang, D., & Neubig, G. (2019).
Density matching for bilingual word embed-
ding. ACL Human Language Technologies,
1588–1598. 322
Zhou, H., Alvarez, J. M., & Porikli, F. (2016a).
Less is more: Towards compact CNNs. Eu-
ropean Conference on Computer Vision, 662–
677. 414
Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C.,
Liu, Z., Wang, L., Li, C., & Sun, M. (2020a).
Graph neural networks: A review of methods
and applications. AI Open, 1, 57–81. 261
Zhou, K., Huang, X., Li, Y., Zha, D., Chen, R., &
Hu, X. (2020b). Towards deeper graph neural
networks with dierentiable group normaliza-
tion. Neural Information Processing Systems,
33, 4917–4928. 265
Zhou, L., Du, Y., & Wu, J. (2021). 3D shape gener-
ation and completion through point-voxel dif-
fusion. IEEE/CVF International Conference
on Computer Vision, 5826–5835. 369
Zhou, T., Krahenbuhl, P., Aubry, M., Huang,
Q., & Efros, A. A. (2016b). Learning dense
correspondence via 3D-guided cycle consis-
tency. IEEE/CVF Computer Vision & Pat-
tern Recognition, 117–126. 301
Zhou, Y.-T., & Chellappa, R. (1988). Computation
of optical ow using a neural network. IEEE
International Conference on Neural Networks,
71–78. 181
Zhou, Z., & Li, X. (2017). Graph convolu-
tion: A high-order and adaptive approach.
arXiv:1706.09916. 263
Zhou, Z., Rahman Siddiquee, M. M., Tajbakhsh,
N., & Liang, J. (2018). UNet++: A nested U-
Net architecture for medical image segmenta-
tion. Deep Learning in Medical Image Analysis
Workshop, 3–11. 205
Zhu, C., Ni, R., Xu, Z., Kong, K., Huang, W. R.,
& Goldstein, T. (2021). GradInit: Learning
to initialize neural networks for stable and e-
cient training. Neural Information Processing
Systems, 34, 16410–16422. 113
Zhu, J., Krähenbühl, P., Shechtman, E., & Efros,
A. A. (2016). Generative visual manipulation
on the natural image manifold. European Con-
ference on Computer Vision, 597–613. 301
Zhu, J., Shen, Y., Zhao, D., & Zhou, B. (2020a).
In-domain GAN inversion for real image edit-
ing. European Conference on Computer Vi-
sion, 592–608. 301
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Bibliography 511
Zhu, J.-Y., Park, T., Isola, P., & Efros, A. A.
(2017). Unpaired image-to-image transla-
tion using cycle-consistent adversarial net-
works. IEEE/CVF International Conference
on Computer Vision, 2223–2232. 296, 301
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai,
J. (2020b). Deformable DETR: Deformable
transformers for end-to-end object detection.
International Conference on Learning Repre-
sentations. 238
Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu,
H., Xiong, H., & He, Q. (2020). A comprehen-
sive survey on transfer learning. Proceedings
of the IEEE, 109(1), 43–76. 159
Ziegler, Z., & Rush, A. (2019). Latent normaliz-
ing ows for discrete sequences. International
Conference on Machine Learning, 7673–7682.
322, 323
Zong, B., Song, Q., Min, M. R., Cheng, W.,
Lumezanu, C., Cho, D., & Chen, H. (2018).
Deep autoencoding Gaussian mixture model
for unsupervised anomaly detection. Inter-
national Conference on Learning Representa-
tions. 344
Zou, D., Cao, Y., Zhou, D., & Gu, Q. (2020).
Gradient descent optimizes over-parameterized
deep ReLU networks. Machine Learning, 109,
467–492. 404
Zou, D., Hu, Z., Wang, Y., Jiang, S., Sun, Y.,
& Gu, Q. (2019). Layer-dependent importance
sampling for training deep and large graph con-
volutional networks. Neural Information Pro-
cessing Systems, 32, 11247–11256. 264
Zou, H., & Hastie, T. (2005). Regularization and
variable selection via the elastic net. Journal of
the Royal Statistical Society: Series B, 67(2),
301–320. 156
Zou, Z., Chen, K., Shi, Z., Guo, Y., & Ye, J. (2023).
Object detection in 20 years: A survey. Pro-
ceedings of the IEEE. 184
Draft: please send errata to udlbookmail@gmail.com.
Index
2
norm, 442
norm, 442
p
norm, 442
<cls> token, 221
1×1 convolution, 174, 181
1D convolution, 163
1D convolutional network, 162–170, 182
2D convolutional network, 170–174
3D U-Net, 205
3D convolutional network, 182
ACGAN, 288
action value, 377
activation, 35
activation function, 25, 38
concatenated ReLU, 38
ELU, 38
GeLU, 38
HardSwish, 38
leaky ReLU, 38
logistic sigmoid, 38
parametric ReLU, 38
ReLU, 25, 38
scaled exponential linear unit, 113
SiLU, 38
Softplus, 38
Swish, 38
tanh, 38
activation normalization, 113
activation pattern, 27
ActNorm, 113
actor-critic method, 393
AdaDelta, 93
AdaGrad, 93
Adam, 88, 93
rectied, 93
AdamW, 94, 155
adaptive kernels, 183
adaptive moment estimation, 93
adaptive training methods, 93
adjacency matrix, 243–245
adjoint graph, 260
advantage estimate, 391
advantage function, 393
adversarial attack, 413
adversarial loss, 292, 301
adversarial training, 149
ane function, 446
aggregated posterior, 340, 341
AlexNet, 174
algorithmic dierentiation, 106
AMSGrad, 93
ancestral sampling, 459
argmax function, 437
argmin function, 437
articial moral agency, 424
ethical impact agent, 424
explicit ethical agent, 424
full ethical agent, 424
implicit ethical agent, 424
asynchronous data parallelism, 114
ATARI 2600 benchmark, 386
atrous convolution, 181
attention
additive, 235
as routing, 235
graph attention network, 258
key-value, 235
local, 237
memory-compressed, 235
memory-compresssed, 237
multiplicative, 235
squeeze-and-excitation network, 235
synthesizer, 235
augmentation, 152–154, 159
in graph neural networks, 264
autocorrelation function, 441
autoencoder, 344
variational, 326–347
automatic translation, 226
automation bias, 429
automation of jobs, 13
autoregressive ow, 311–313, 323
auxiliary classier GAN, 288
average pooling, 171, 181
aysmptotic notation, 438
backpropagation, 97–106, 113
Draft: please send errata to udlbookmail@gmail.com.
514 Index
in branching graphs, 107
on acyclic graph, 116
bagging, 146
Banach xed point theorem, 314
baseline, 391
batch, 85
batch normalization, 192–194, 203
alternatives to, 205
costs and benets, 194
ghost, 203
Monte Carlo, 203
why it helps, 204
batch reinforcement learning, 394
batch renormalization, 203
Bayes’ rule, 450
Bayesian neural networks, 150
Bayesian optimization, 135
beam search, 224
behavior policy, 384
Bellman equations, 379
Bernoulli distribution, 65
BERT, 219–222
beta VAE, 342, 345
beta-Bernoulli bandit, 135
bias (component of test error), 122
bias and fairness, 13, 421–424
fairness through unawareness, 423
mitigation, 423
protected attribute, 423
separation, 423
bias parameter, 36
bias vector, 49
bias-variance trade-o, 125
big-O notation, 438
BigBird, 237
bijection, 439
binary classication, 64–66
binary cross-entropy loss, 66
binomial coecient, 441
BlockDrop, 202
BOHB, 136
Boltzmann policy, 399
bootstrap aggregating, 146
BPE dropout, 234
byte pair encoding, 218, 234
capacity, 29, 46, 125, 134
eective, 134
Rademacher complexity, 134
Vapnik-Chervonenkis dimension, 134
capsule network, 235
cascaded diusion model, 367, 369
categorical distribution, 67
channel, 165
channel-separate convolution, 181
chatbot, 398
ChatGPT, 398
InstructGPT, 398
classical regime, 129
classication
binary, 2, 64–66
ImageNet, 174–176, 181
multiclass, 2, 67–69
text, 221
classier guidance, 364, 370
classier-free guidance, 365
CLIP, 238
cls token, 221
Cluster GCN, 264
CNN, see convolutional network
colorization, 291
column space, 443
computer vision
image classication, 174
object detection, 177
semantic segmentation, 178
concave function, 440
concentration of power, 430
concept shift, 135
conditional GAN, 288, 300
conditional generation, 7, 288, 290, 370
conditional probability, 449
conditional VAE, 344
continuous distribution, 448
contraction mapping, 314
control network, 370
convex function, 80, 91
convex region, 440
convolution
1×1, 174, 181
1D, 163
adaptive, 183
atrous, 181
channel, 165
depthwise, 181
dilated, 164, 181
feature map, 165
fractionally strided, 278
gated, 181
grouped, 181
guided, 183
kernel, 163
padding, 164
partial, 181
separable, 181
stride, 164
transposed, 172, 181
valid, 164
convolutional layer, 161, 165
convolutional network, 161–185
1D, 162–170, 182
2D, 170–174
3D, 182
AlexNet, 174
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Index 515
changing number of channels, 174
downsampling, 171
early applications, 180
Geodesic CNN, 265
GoogLeNet, 181
inductive bias, 170
LeNet, 180
network-in-network, 181
upsampling, 172
VGG, 176
visualizing, 184
ConvolutionOrthogonal initializer, 183
cost function, see loss function
coupling ow, 310–311, 323
coupling function, 322
covariance, 454
covariance matrix, 454, 456
diagonal, 456
full, 456
spherical, 456
covariant function, 162
covariate shift, 135
cross-attention, 227
cross-covariance image transformers, 238
cross-entropy, 71–72
cross-validation, 134
Crossformer, 238
curse of dimensionality, 129, 135
cutout, 158, 183
CycleGAN, 292–295
DALL·E-2, 11, 370
data
structured, 17
tabular, 17
training set, 118
data augmentation, 152–154, 159
data drift, 135
concept shift, 135
covariate shift, 135
prior shift, 135
data parallelism
asynchronous, 114
synchronous, 114
data privacy, 428
dataset
ImageNet, 174
MNIST, 291
MNIST-1D, 118
DaViT, 232, 238
DCGAN, 278
DDIM, 370
deadly triad issue, 396
decision transformer, 394
decoder
convolutional network, 179
diusion model, 348
transformer, 222–227
VAE, 337
decoding algorithm, 234
deep convolutional GAN, 278
deep dueling network, 397
deep neural network, 41–55
matrix notation, 49
necessity of, 417–418
number of linear regions, 50, 52
vs. shallow, 49–51
deep Q-network, 385–387, 396
DeepSets, 263
degree matrix, 257
denoising diusion implicit model, 364, 370
DenseNet, 195, 205
depth eciency, 50, 53
depth of neural network, 46
depthwise convolution, 181
design justice, 433
determinant, 444
diagonal covariance matrix, 456
diagonal enhancement, 257
diagonal matrix, 445
dieomorphism, 439
dierentiation
forward mode, 117
reverse mode, 116
DiPool, 265
diusion model, 367, 348–372
applications, 369
cascaded, 369
classier guidance, 364, 370
classier-free guidance, 365
computing likelihood, 369
conditional distribution, 353–355
control network, 370
DALL·E-2, 370
DDIM, 370
decoder, 348, 355–356
denoising diusion implicit model, 364
encoder, 348–355
evidence lower bound, 356–358
forward process, 349–355
generating images, 362
GLIDE, 370
image generation, 367
Imagen, 370
implementation, 362–367
improving quality, 369
improving speed, 363, 369
kernel, 350–352
loss function, 358–359
marginal distribution, 352–353
noise conditioning augmentation, 369
noise schedule, 349
probability ow ODE, 370
relation to other models, 371
Draft: please send errata to udlbookmail@gmail.com.
516 Index
reparameterization, 360–362
reverse process, 355–356
text-to-image, 370
training, 356–360
update, 349
vs. other generative models, 369
dilated convolution, 164, 181
dilation rate, 165
Dirac delta function, 440
discount factor, 374
discrete distribution, 448
discriminative model, 23
discriminator, 275
disentangled latent space, 270
disentanglement, 341, 345
distance between distributions, 459–461
distance between normal distributions, 461
distillation, 415, 418
distributed training, 114
data parallelism
asynchronous, 114
synchronous, 114
pipeline model parallelism, 114
tensor model parallelism, 114
distribution
Bernoulli, 65
categorical, 67
mixture of Gaussians, 75, 327
multivariate normal, 456
normal, 61
Poisson, 76
univariate normal, 456
von Mises, 74
divergence
Jensen-Shannon, 460
Kullback-Leibler, 71, 460
diversity, 433
dot product, 443
dot-product self-attention, 208–215
key, 210
matrix representation, 212
query, 210
scaled, 214
value, 208
double descent, 127, 134, 412
epoch-wise, 134
double DQN, 387
double Q-learning, 387
downsample, 171
DropConnect, 265
DropEdge, 264
dropout, 147
cutout, 158, 183
in max pooling layer, 183
Monte Carlo, 158
recurrent, 158
spatial, 158, 183
dual attention vision transformer, 232, 238
dual-primal graph CNN, 264
dying ReLU problem, 38
dynamic programming method, 382
early stopping, 145, 157
edge, 240
directed, 241
embedding, 243
undirected, 241
edge graph, 260
eective model capacity, 134
eigenspectrum, 444
eigenvalue, 444
elastic net penalty, 156
ELBO, see evidence lower bound
elementwise ow, 309–310, 322
ELIZA eect, 428
ELU, 38
EM algorithm, 346
embedding, 218
employment, 429
encoder, 179
convolutional network, 179
diusion model, 348
transformer, 219–222
VAE, 337
encoder-decoder network, 179
encoder-decoder self-attention, 227
encoder-decoder transformer, 226
ensemble, 145, 157, 158
fast geometric, 158
snapshot, 158
stochastic weight averaging, 158
entropy SGD, 158
environmental impact, 429
episode, 383
epoch, 85
epsilon-greedy policy, 384
equivariance, 162
group, 182
permutation, 239
rotation, 182
translation, 182
ethical impact agent, 424
ethics, 12–14, 420–435
articial moral agency, 424
ethical impact agent, 424
explicit ethical agent, 424
full ethical agent, 424
implicit ethical agent, 424
automation bias, 429
automation of jobs, 13
bias and fairness, 13, 421–424
fairness through unawareness, 423
mitigation, 423
protected attribute, 423
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Index 517
separation, 423
case study, 430
concentration of power, 430
diversity, 433
employment, 429
environmental impact, 429
existential risk, 14
explainability, 13, 425–426
LIME, 426
intellectual property, 428
misuse
militarization, 13
misuse of AI, 426
data privacy, 428
face recognition, 426
fraud, 427
militarization, 427
political interference, 427
moral deskilling, 429
scientic communication, 432
transparency, 424–425
functional, 425
run, 425
structural, 425
value alignment, 420–426
inner alignment problem, 421
outer alignment problem, 421
principal agent problem, 421
value-free ideal of science, 431
Euclidean norm, 442
evidence, 451
evidence lower bound, 330–335
properties, 333
reformulation, 333
tightness of bound, 333
existential risk, 14
expectation, 452–455
rules for manipulating, 452
expectation maximization, 346
experience replay, 386
explainability, 13, 425–426
LIME, 426
explicit ethical agent, 424
exploding gradient problem, 108
in residual networks, 192
exploration-exploitation trade-o, 373
exponential function, 440
extended transformer construction, 237
face recognition, 426
factor analysis, 344
factor VAE, 346
fairness, see bias
fairness through unawareness, 423
fast geometric ensembles, 158
feature map, 165
feature pyramid network, 183
feed-forward network, 35
few-shot learning, 224, 225
lter, 163
ne-tuning, 151, 152
tted Q-learning, 384
tting, see training
Fixup, 205
atness of minimum, 411
ooding, 159
focal loss, 73
forward-mode dierentiation, 117
Fréchet distance, 461
between normals, 461
Fréchet inception distance, 272
fractionally strided convolution, 278
fraud, 427
Frobenius norm, 442
regularization, 140, 155
full covariance matrix, 456
full ethical agent, 424
full-batch gradient descent, 85
fully connected, 36
function, 437, 439
bijection, 439
dieomorphism, 439
exponential, 440
gamma, 440
injection, 439
logarithm, 440
surjection, 439
functional transparency, 425
Gabor model, 80
gamma function, 440
GAN, see generative adversarial network
gated convolution, 181
gated multi-layer perceptron, 235
Gaussian distribution, see normal distribution
GeLU, 38
generalization, 118, 402
factors that determine, 410–414
generative adversarial network, 275–302
ACGAN, 288
adversarial loss, 292
conditional, 288, 300
conditional generation, 288–290
CycleGAN, 292–295
DCGAN, 278
diculty of training, 279
discriminator, 275
editing images with, 301
generator, 275
image translation, 290–295
InfoGAN, 290
inverting, 301
least squares, 299
loss function, 276, 299
Draft: please send errata to udlbookmail@gmail.com.
518 Index
mini-batch discrimination, 288, 300
mode collapse, 279, 300
mode dropping, 279
multiple scales, 300
PatchGAN, 291
Pix2Pix, 291
progressive growing, 286, 300
SRGAN, 292
StyleGAN, 295–297, 300
tricks for training, 299
truncation trick, 288
VEEGAN, 300
Wasserstein, 280–285, 299
gradient penalty, 285
weight clipping, 285
generative model, 7, 23, 223, 269
desirable properties, 269
quantifying performance, 271
generator, 275
geodesic CNN, 265
geometric graph
example, 241
geodesic CNN, 265
MoNet, 265
ghost batch normalization, 203
GLIDE, 370
global minimum, 81
Glorot initialization, 113
GLOW, 318–320, 323
Goldilocks zone, 410, 412
GoogLeNet, 181
GPT3, 222–227
decoding, 223
few-shot learning, 224
GPU, 107
gradient checkpointing, 114
gradient descent, 77–78, 91
GradInit, 113
graph
adjacency matrix, 243–245
adjoint, 260
edge, 240, 260
directed, 241
embedding, 243
undirected, 241
examples, 240
expansion problem, 254
geometric, 241
heterogenous, 241
hierarchical, 241
knowledge, 241
line, 260
max pooling aggregation, 258
neighborhood sampling, 254
node, 240
embedding, 243, 244
partitioning, 254
real world, 240
tasks, 246
types, 241
graph attention network, 258, 263
graph isomorphism network, 264
graph Laplacian, 262
graph neural network, 240–267
augmentation, 264
batches, 264
dual-primal graph CNN, 264
graph attention network, 258
GraphSAGE, 262
higher-order convolutional layer, 263
MixHop, 263
MoNet, 262
normalization, 265
over-smoothing, 265
pooling, 265
regularization, 264
residual connection, 263, 266
spectral methods, 262
suspended animation, 265–266
graphics processing unit, 107
GraphNorm, 265
GraphSAGE, 262
GraphSAINT, 264
GResNet, 263
grokking, 412
group normalization, 203
grouped convolution, 181
guided convolution, 183
HardSwish, 38
He initialization, 110, 113
Heaviside function, 104
heteroscedastic regression, 64, 73
hidden layer, 35
hidden unit, 27, 35
hidden variable, see latent variable
hierarchical graph, 241
highway network, 202
Hogwild!, 114
homoscedastic regression, 64
hourglass network, 179, 197, 205
stacked, 198
Hutchinson trace estimator, 316
hyperband, 136
hypernetwork, 235
hyperparameter, 46
model, 46
training algorithm, 91
hyperparameter search, 132, 133, 135
Bayesian optimization, 135
beta-Bernoulli bandit, 135
BOHB, 136
hyperband, 136
random sampling, 135
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Index 519
SMAC, 136
Tree-Parzen estimators, 136
i.i.d., see independent and identically distributed
identity matrix, 445
image interpolation, 11
image translation, 290–295, 301
ImageGPT, 229
Imagen, 370
ImageNet classication, 174–176, 181
implicit ethical agent, 424
implicit regularization, 144, 156–157
importance sampling, 339
inception block, 181
inception score, 271
independence, 451
independent and identically distributed, 58
inductive bias, 129
convolutional, 170
relational, 248
inductive model, 252
inference, 17
innitesimal ows, 324
InfoGAN, 290
information preference problem, 345
initialization, 107–111
ActNorm, 113
convolutional layers, 183
ConvolutionOrthogonal, 183
Fixup, 205
Glorot, 113
GradInit, 113
He, 113
layer-sequential unit variance, 113
LeCun, 113
SkipInit, 205
TFixup, 237
Xavier, 113
injection, 439
inner alignment problem, 421
inpainting, 8
instance normalization, 203
InstructGPT, 398
intellectual property, 428
internal covariate shift, 203
interpretability, see explainability
intersectionality, 423
invariance, 161
permutation, 162, 213, 249
rotation, 182
scale, 182
translation, 182
inverse autoregressive ow, 313
inverse of a matrix, 443
invertible layer, 308
autoregressive ow, 311–313, 323
coupling ow, 310–311
elementwise ow, 309–310
linear ow, 308–309
residual ow, 313–316, 323
iResNet, 314–316, 323
iRevNet, 313–314, 323
Jacobian, 447
Janossy pooling, 263
Jensen’s inequality, 330
Jensen-Shannon divergence, 460
joint probability, 448
k-fold cross-validation, 134
k-hop neighborhood, 254
kernel, 163
size, 163
key, 210
Kipf normalization, 258, 262
KL divergence, see Kullback-Leibler divergence
knowledge distillation, 415, 418
knowledge graph, 241
Kullback-Leibler divergence, 71, 460
between normals, 461
L
k
pooling, 181
L-innity norm, 442
L0 regularization, 155
L1 regularization, 156
L2 norm, 442
L2 regularization, 140
label, 64
label smoothing, 149, 158
language model, 222, 234
few-shot learning, 224
GPT3, 222–227
large language model, 224, 234
LASSO, 155, 156
latent space
disentangled, 270
latent variable, 7, 268
latent variable model, 326
mixture of Gaussians, 327
nonlinear, 327
layer, 35
convolutional, 161, 165
hidden, 35
input, 35
invertible, 308
autoregressive ow, 311–313
coupling ow, 310–311
elementwise ow, 309–310
linear ow, 308–309
residual ow, 313–316, 323
output, 35
residual, 189
layer normalization, 203
layer-sequential unit variance initialization, 113
Draft: please send errata to udlbookmail@gmail.com.
520 Index
layer-wise DropEdge, 264
leaky ReLU, 38
learning, 18
learning rate, 78
schedule, 86
warmup, 93
learning to rank, 73
least squares GAN, 299
least squares loss, 19, 62
LeCun initialization, 113
LeNet, 180
likelihood, 58, 450, 451
likelihood ratio identity, 389
LIME, 426
line graph, 260
line search, 92
linear algebra, 446
linear ow, 308–309, 322
linear function, 27, 446
linear programming, 284
linear regression, 18
LinFormer, 237
Lipschitz constant, 439
local attention, 237
local minimum, 81
in real loss functions, 408
log-likelihood, 59
logarithm, 440
logistic regression, 94
logistic sigmoid, 66
loss, 19–21
adversarial, 292
perceptual, 292
VGG, 292
loss function, 21, 56–76
binary cross-entropy, 66
convex, 80
cross-entropy, 71–72
focal, 73
global minimum, 81
least squares, 19, 62
local minimum, 81
multiclass cross-entropy, 69
negative log-likelihood, 60
non-convex, 80
pinball, 73
properties of, 406–410
quantile, 73
ranking, 73
recipe for computing, 60
saddle point, 81
vs. cost function, 23
vs. objective function, 23
lottery ticket, 406
lottery ticket , 415
lower triangular matrix, 445
LP norm, 442
manifold, 273
manifold precision/recall, 273
marginalization, 449
Markov chain, 350
Markov decision process, 377
Markov process, 373
Markov reward process, 373
masked autoregressive ow, 312
masked self-attention, 223
matrix, 436, 442
calculus, 447
column space, 443
determinant, 444
eigenvalue, 444
inverse, 443
Jacobian, 447
permutation, 245
product, 443
singular, 443
special types, 445
diagonal, 445
identity, 445
lower triangular, 445
orthogonal, 446
permutation, 446
upper triangular, 445
trace, 444
transpose, 442
max function, 437
max pooling, 171, 181
max pooling aggregation, 258
max unpooling, 172, 181
MaxBlurPool, 182
maximum a posteriori criterion, 139
maximum likelihood, 56–59
mean, 454
mean pooling, 171, 246
measuring performance, 118–137
median estimation, 73
memory-compressed attention, 237
micro-batching, 114
militarization, 427
min function, 437
mini-batch, 85
discrimination, 288, 300
minimax game, 277
minimum, 81
connections between, 407
family of, 407
global, 81
local, 81
route to, 407
misuse, 426
data privacy, 428
face recognition, 426
fraud, 427
militarization, 427
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Index 521
political interference, 427
MixHop, 263
mixture density network, 74
mixture model network, 262, 265
mixture of Gaussians, 75, 327
MLP, 35
MNIST, 291
MNIST-1D, 118
mode collapse, 279, 300
mode dropping, 279
model, 17
capacity, 134
eective, 134
representational, 134
inductive, 252
machine learning, 4
parameter, 18
testing, 22
transductive, 252
modern regime, 129
momentum, 86, 92
Nesterov, 86, 92
MoNet, 262, 265
Monte Carlo batch normalization, 203
Monte Carlo dropout, 158
Monte Carlo method, 381, 383
moral deskilling, 429
multi-head self-attention, 214
multi-layer perceptron, 35
multi-scale ow, 316–317
multi-scale vision transformer, 230, 238
multi-task learning, 151
multiclass classication, 67–69
multiclass cross-entropy loss, 69
multigraph, 241
multivariate normal, 456
multivariate regression, 69
MViT, 238
NAdam, 92
named entity recognition, 221
Nash equilibrium, 277
natural language processing, 207, 216
automatic translation, 226
benchmarks, 234
BERT, 219–222
embedding, 218
GPT3, 222–227
named entity recognition, 221
question answering, 222
sentiment analysis, 221
tasks, 232
text classication, 221
tokenization, 218
natural policy gradients, 397
negative log-likelihood, 60
neighborhood sampling, 254
Nesterov accelerated momentum, 86, 92
network, see neural network
network dissection, 184
network inversion, 184
network-in-network, 181
neural architecture search, 132
neural network
shallow, 25–40
bias, 36
capacity, 29, 46
capsule, 235
composing, 41
convolutional, 161–185
deep, 41–55
deep vs. shallow, 49–51
depth, 46
depth eciency, 50, 53
encoder-decoder, 179
feed-forward, 35
fully connected, 36
graph, 240–267
highway, 202
history, 37
hourglass, 179, 197
hyperparameter, 46
layer, 35
matrix notation, 49
recurrent, 233
residual, 186, 206
stacked hourglass, 198
transformer, 207–239
U-Net, 197
weights, 36
width, 46
width eciency, 53
neural ODE, 202
neuron, see hidden unit
Newton method, 92
NLP, see natural language processing
node, 240
embedding, 243, 244
noise, 122
adding to inputs, 149
adding to weights, 149
noise conditioning augmentation, 369
noise schedule, 349
noisy deep Q-network, 397
non-convex function, 80
non-negative homogeneity, 39
nonlinear function, 27
nonlinear latent variable model, 327
norm
p
, 442
Euclidean, 442
spectral, 444
vector, 442
norm of weights, 412
Draft: please send errata to udlbookmail@gmail.com.
522 Index
normal distribution, 61, 456–458
change of variable, 458
distance between, 461
Frećhet distance between, 461
KL divergence between, 461
multivariate, 456
product of two normals, 458
sampling, 459
standard, 456
univariate, 456
Wasserstein distance between, 461
normalization
batch, 192–194
Monte Carlo, 203
batch renormalization, 203
ghost batch, 203
group, 203
in graph neural networks, 265
instance, 203
Kipf, 258, 262
layer, 203
normalizing ows, 303–325
applications, 322
autoregressive, 311–313, 323
coupling, 323
coupling ow, 310–311
coupling functions, 322
elementwise, 309–310, 322
generative direction, 305
GLOW, 318–320, 323
in variational inference, 320
innitesimal, 324
inverse autoregressive, 313
linear, 308–309, 322
masked autoregressive, 312
multi-scale, 316–317
normalizing direction, 305
planar, 322
radial, 322
residual, 313–316, 323
iResNet, 314–316
iRevNet, 313–314
universality, 324
notation, 436–438
nullspace, 443
number set, 436
object detection, 177, 183
feature pyramid network, 183
proposal based, 183
proposal free, 184
R-CNN, 183
YOLO, 177, 184
objective function, see loss function
o-policy method, 384
oine reinforcement learning, 394
on-policy method, 384
one-hot vector, 218, 245
opacity, see transparency
optimization
AdaDelta, 93
AdaGrad, 93
Adam, 93
AdamW, 94
algorithm, 77
AMSGrad, 93
gradient descent, 91
learning rate
warmup, 93
line search, 92
NAdam, 92
Newton method, 92
objective function, 91
RAdam, 93
RMSProp, 93
SGD, 91
stochastic variance reduced descent, 91
YOGI, 93
orthogonal matrix, 446
outer alignment problem, 421
output function, see loss function
over-smoothing, 265
overtting, 22, 125
overparameterization, 404, 414
padding, 164
PairNorm, 265
parameter, 18, 436
parameteric ReLU, 38
partial convolution, 181
partially observable MDP, 377
PatchGAN, 291
PCA, 344
perceptron, 37
perceptual loss, 292
performance, 118–137
Performer, 237
permutation invariance, 162, 213, 249
permutation matrix, 245, 446
pinball loss, 73
pipeline model parallelism, 114
pivotal tuning, 301
Pix2Pix, 291
PixelShue, 182
PixelVAE, 344
planar ow, 322
Poisson distribution, 76
policy, 377
behavior, 384
Boltzmann, 399
epsilon-greedy, 384
target, 384
policy gradient method, 388
PPO, 397
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Index 523
REINFORCE, 391
TRPO, 397
policy network, 12
political interference, 427
POMDP, 377
pooling
k
, 181
average, 181
in graph neural networks, 265
Janossy, 263
max, 181
max-blur, 181
positional encoding, 213, 236
posterior, 451
posterior collapse, 345
PPO, 397
pre-activation, 35
pre-activation residual block, 201
pre-training, 151
transformer encoder, 219
precision, 273
principal agent problem, 421
prior, 139, 140, 451
prior shift, 135
prioritized experience replay, 396
probabilistic generative model, 269
probabilistic PCA, 344
probability, 448–461
Bayes’ rule, 450
conditional, 449
density function, 448
distribution, 437, 448
Bernoulli, 65
categorical, 67
continuous, 448
discrete, 448
distance between, 459–461
mixture of Gaussians, 327
multivariate normal, 456
normal, 456–458
Poisson, 76
sampling from, 459
univariate normal, 61, 456
von Mises, 74
joint, 448
marginalization, 449
notation, 437
random variable, 448
probability ow ODE, 370
progressive growing, 286, 300
protected attribute, 423
proximal policy optimization, 397
pruning, 414
pyramid vision transformer, 238
PyTorch, 106
Q-learning, 384
deep, 385–387
double, 387
double deep, 387
tted, 384
noisy deep Q-network, 397
quantile loss, 73
quantile regression, 73
query, 210
question answering, 222
R-CNN, 183
Rademacher complexity, 134
radial ow, 322
Rainbow, 397
random synthesizer, 235
random variable, 448
RandomDrop, 202
ranking, 73
recall, 273
receptive eld, 167
in graph neural networks, 254
reconstruction loss, 334
rectied Adam, 93
rectied linear unit, 25
derivative of, 104
dying ReLU problem, 38
non-negative homogeneity, 39
recurrent dropout, 158
recurrent neural network, 233
regression, 2
heteroscedastic, 73
multivariate, 2, 69
quantile, 73
robust, 73
univariate, 61
regularization, 131, 138–160
AdamW, 155
adding noise, 158
adding noise to inputs, 149
adding noise to weights, 149
adversarial training, 149
augmentation, 152–154
bagging, 146
Bayesian approaches, 150
data augmentation, 159
DropConnect, 265
DropEdge, 264
dropout, 147
early stopping, 145, 157
elastic net, 156
ensemble, 145, 157
ooding, 159
Frobenius norm, 140, 155
implicit, 141, 144, 156–157
in graph neural networks, 264
L0, 155
L1, 156
Draft: please send errata to udlbookmail@gmail.com.
524 Index
L2, 140
label smoothing, 149, 158
LASSO, 156
multi-task learning, 151
probabilistic interpretation, 139
RandomDrop, 202
ResDrop, 202
ridge regression, 140
self-supervised learning, 159
shake drop, 203
shake-shake, 203
stochastic depth, 202
Tikhonov, 140
transfer learning, 151, 159
weight decay, 140
vs. L2, 155
REINFORCE, 391
reinforcement learning, 373–400
action value, 377
advantage function, 393
baseline, 391
batch, 394
Bellman equations, 379
classical, 396
deadly triad issue, 396
deep dueling network, 397
discount factor, 374
dynamic programming methods, 382
episode, 381, 383
experience replay, 386
exploration-exploitation trade-o, 373
for combinatorial optimization, 396
introduction, 11–12
Markov decision process, 377
Monte Carlo method, 381, 383
natural policy gradients, 397
oine, 394
policy, 377
behavior, 384
Boltzmann, 399
epsilon-greedy, 384
optimal, 378
target, 384
policy gradient method, 388
PPO, 397
REINFORCE, 391
TRPO, 397
policy network, 12
POMDP, 377
prioritized experience replay, 396
Q-learning, 384
deep Q-network, 385–387, 396
double DQN, 387
double Q-learning, 387
tted, 384
noisy deep Q-network, 397
Rainbow, 397
return, 374
reward, 374
rollout, 381
SARSA, 384
state value, 377
state-action value function, 378
state-value function, 378
tabular, 381–384
temporal dierence method, 384
trajectory, 381
value, 374
with human feedback, 398
relational inductive bias, 248
ReLU, see rectied linear unit
reparameterization trick, 338, 346
representational capacity, 134
ResDrop, 202
residual block, 189
order of operations, 191
residual connection, 189
in graph neural networks, 263, 266
why improves performance, 202
residual ow, 313–316, 323
iResNet, 314–316
iRevNet, 313–314
residual network, 186–206
as ensemble, 202
performance, 198
stable ResNet, 205
unraveling, 189
ResNet v1 & v2, 201
ResNet-200, 195
ResNeXt, 202
resynthesis, 341
return, 374
reverse-mode dierentiation, 116
reward, 374
ridge regression, 140
RL, see reinforcement learning
RMSProp, 93
RNN, 233
robust regression, 73
rollout, 381
rotation equivariance, 182
rotation invariance, 182
run transparency, 425
saddle point, 81, 83, 91
sampling, 459
ancestral, 459
from multivariate normal, 459
SARSA, 384
scalar, 436
scale invariance, 182
scaled dot-product self-attention, 214
scaled exponential linear unit, 113
scientic communication, 432
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Index 525
segmentation, 178
U-Net, 199
self-attention, 208–215
as routing, 235
encoder-decoder, 227
key, 210
masked, 223
matrix representation, 212
multi-head, 214
positional encoding, 213
query, 210
scaled dot-product, 214
value, 208
self-supervised learning, 152, 159
contrastive, 152
generative, 152
SeLU, 113
semantic segmentation, 178, 184
semi-supervised learning, 252
SentencePiece, 234
sentiment analysis, 221
separable convolution, 181
separation, 423
sequence-to-sequence task, 226
sequential model-based conguration, 136
set, 436
SGD, see stochastic gradient descent
shake-drop, 203
shake-shake, 203
shallow neural network, 25–40, 41
shattered gradients, 187
SiLU, 38
singular matrix, 443
skip connection, see residual connection
SkipInit, 205
Slerp, 341
SMAC, 136
snapshot ensembles, 158
softmax, 68
Softplus, 38
sparsity, 155
spatial dropout, 183
spectral norm, 444
spherical covariance matrix, 456
spherical linear interpolation, 341
SquAD question answering task, 222
squeeze-and-excitation network, 235
SRGAN, 292
Stable Diusion, 369
stable ResNet, 205
stacked hourglass network, 198, 205
stacking, 157
standard deviation, 454
standard normal distribution, 456
standardization, 455
standpoint epistemology, 433
state value, 377
state-action value function, 378
state-value function, 378
Stirling’s formula, 440
stochastic depth, 202
stochastic gradient descent, 83–86, 91, 403
full batch, 85
properties, 85
stochastic variance reduced descent, 91
stochastic weight averaging, 158
stride, 164
structural transparency, 425
structured data, 17
StyleGAN, 295–297, 300
sub-word tokenizer, 218
subspace, 443
super-resolution, 292
supervised learning, 1–7, 17–24
surjection, 439
suspended animation, 265–266
SWATS, 94
SWin transformer, 230, 238
v2, 238
Swish, 38
synchronous data parallelism, 114
synthesizer, 235
random, 235
tabular data, 2, 17
tabular reinforcement learning, 381–384
target policy, 384
teacher forcing, 234
technochauvanism, 433
technological unemployment, 429
temporal dierence method, 381, 384
tensor, 107, 436, 442
tensor model parallelism, 114
TensorFlow, 106
test data, 22
test error
bias, 122
double descent, 127
noise, 122
variance, 122
test set, 118
text classication, 221
text synthesis, 8, 224–225
conditional, 9
text-to-image, 367, 370
TFixup, 237
Tikhonov regularization, 140
tokenization, 218, 234
BPE dropout, 234
Byte pair encoding, 234
SentencePiece, 234
sub-word, 218
WordPiece, 234
top-k sampling, 224
Draft: please send errata to udlbookmail@gmail.com.
526 Index
total correlation VAE, 345
trace, 444
Hutchinson estimator, 316
training, 5, 18, 77–117
batch, 85
epoch, 85
error, 19
factors that determine success, 402–406
gradient checkpointing, 114
micro-batching, 114
reducing memory requirements, 114
stochastic gradient descent, 83–86
tractability, 401
trajectory, 381
transductive model, 252
transfer learning, 151, 159, 219
ne-tuning, 151
pre-training, 151
transformer, 207–239
applications, 233
applied to images, 228–232
BERT, 219–222
BigBird, 237
CLIP, 238
combined with CNNs, 238
combining images and text, 238
cross covariance image transformer, 238
Crossformer, 238
DaViT, 232, 238
decoding algorithm, 234
denition, 215
encoder model, 219–222
encoder-decoder model, 226
extended construction, 237
for NLP, 216
for video processing, 238
for vision, 238
ImageGPT, 229
LinFormer, 237
long sequences, 227
multi-head self-attention, 214
multi-scale, 238
multi-scale vision, 230
Performer, 237
positional encoding, 213, 236
pyramid vision, 238
scaled dot-product attention, 214
SWin, 230, 238
SWin V2, 238
synthesizer, 235
TFixup, 237
ViT, 229
translation (automatic), 226
translation equivariance, 182
translation invariance, 182
transparency, 424–425
functional, 425
run, 425
structural, 425
transport plan, 283
transpose, 442
transposed convolution, 172, 181
Tree-Parzen estimators, 136
triangular matrix, 445
TRPO, 397
truncation trick, 288
trust region policy optimization, 397
U-Net, 197, 205
++, 205
3D, 205
segmentation results, 199
undertting, 22
undirected edge, 241
univariate normal, 456
univariate regression, 61
universal approximation theorem, 29
depth, 53
width, 38
universality
normalizing ows, 324
unpooling, 181
unraveling, 189
unsupervised learning, 7–11, 268–274
model taxonomy, 268
upper triangular matrix, 445
upsample, 171
V-Net, 205
VAE, see variational autoencoder
valid convolution, 164
value, 208, 374
value alignment, 420–426
inner alignment problem, 421
outer alignment problem, 421
principal agent problem, 421
value of action, 377
value of state, 377
value-free ideal of science, 431
vanishing gradient problem, 108
in residual networks, 192
Vapnik-Chervonenkis dimension, 134
variable, 436
variance, 122, 454
identity, 454
variational approximation, 335
with normalizing ows, 320
variational autoencoder, 326–347
adversarially learned inference, 345
aggregated posterior, 340, 341
applications, 343
beta VAE, 342, 345
combination with other models, 345
conditional, 344
This work is subject to a Creative Commons CC-BY-NC-ND license. (C) MIT Press.
Index 527
decoder, 337
disentanglement, 341
encoder, 337
estimating probability, 339
factor VAE, 346
generation, 340
hierarchical model for posterior, 344
holes in latent space, 345
information preference problem, 345
modifying likelihood term, 344
normalizing ows, 344
PixelVAE, 344
posterior collapse, 345
relation to EM algorithm, 346
relation to other models, 344
reparameterization trick, 338, 346
resynthesis, 341
total correlation VAE, 345
VC dimension, 134
vector, 436, 442
dot product, 443
norm, 442
VEEGAN, 300
VGG, 176
VGG loss, 292
vision transformer
DaViT, 232
ImageGPT, 229
multi-scale, 230
SWin, 230
ViT, 229
visualizing activations:, 184
ViT, 229
von Mises distribution, 74
Wasserstein distance, 282
between normals, 461
Wasserstein GAN, 280–285, 299
weaponization of AI, 13
weight, 36
decay, 140, 155
initialization, 107–111
matrix, 49
Weisfeiler-Lehman graph isomorphism test, 264
WGAN-GP, 285
wide minima, 158
width eciency, 53
width of neural network, 46
word classication, 221
word embedding, 218
WordPiece, 234
Xavier initialization, 113
YOGI, 93
YOLO, 177, 184
zero padding, 164
Draft: please send errata to udlbookmail@gmail.com.